dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
3640
share rss forum feed

doppler

join:2003-03-31
Blue Point, NY

1 recommendation

Theoretical question about MD5 sum, Not about how secure it's not.

I use to use CRC-32 and size of a file to determine if files are the same. This proved
to be lacking. I rewrote some of my database to use the MD5 sum of a file.

The theoretical question is: Is it possible for two files to hash-out using MD5 and
not at all be the same?

2^128 power is a very large number. If it's possible. How likely without really trying
to get a match?

This question is not how secure MD5 is, it has been proven not to be.
Using a SHA version of hash, is just too overboard for my usage.

Thanks

HELLFIRE
Premium
join:2009-11-25
kudos:18

switchman

join:1999-11-06

1 recommendation

reply to doppler
I think the OP was referring to Hash Collisions. See the link below

»www.mscs.dal.ca/~selinger/md5collision/

doppler

join:2003-03-31
Blue Point, NY
said by switchman:

I think the OP was referring to Hash Collisions. See the link below

»www.mscs.dal.ca/~selinger/md5collision/

This is why MD5 is insecure. Collisions!

Without trying to create a collision, how likely could one occur
naturally? That's my real question.


novaflare
The Dragon Was Here
Premium
join:2002-01-24
Barberton, OH
I would say with as many files as there are out there it has almost had to have happened naturally. Yeh 2 to the 128 is massively huge how ever the total number of files on the web alone not even counting internal documents it has i would think happened at least once.

How likely not very. MD5 is a case of it is good enough for the majority of users for day to day stuff. Mostly it has come to be used for mostly checking to make sure a file is not corrupted and less for security. Least in day to day things


jvmorris
I Am The Man Who Was Not There.
Premium,MVM
join:2001-04-03
Reston, VA
kudos:1
reply to doppler
See table near the bottom of »en.wikipedia.org/wiki/Secure_Hash_Algorithm . If I recall correctly, there were published articles circa 2000 (if not earlier) showing that it was possible to construct a file with any desired MD5 hash, including that of another file. Now, the real question is whether you can construct a file of the same size and having usable information that differs from the original. I never saw that question addressed, but then I went to SHA-1 around 2000.
Computing the SHA-1 hash of a file is not difficult and I was doing it on 8-bit microcomputers routinely. Indeed, I'd hash every file on the computer and then search for duplicates back then. (That got tedious, but it was certainly doable.)

A flaw in SHA-1 was documented sometime prior to 2010 and its further use was deprecated by that date.
--
Regards,
Joseph V. Morris


Pjr
Don't Panic

join:2005-12-11
UK
reply to doppler
said by doppler:

Using a SHA version of hash, is just too overboard for my usage.

Why not use SHA if you're concerned about the very remote chance of a collision?
--
Overflow error in /dev/null


jvmorris
I Am The Man Who Was Not There.
Premium,MVM
join:2001-04-03
Reston, VA
kudos:1
reply to switchman
said by switchman:

I think the OP was referring to Hash Collisions. See the link below

»www.mscs.dal.ca/~selinger/md5collision/

That's a very good article, well worth reading. Thank you.
--
Regards,
Joseph V. Morris

dave
Premium,MVM
join:2000-05-04
not in ohio
kudos:8
Reviews:
·Verizon FiOS

1 recommendation

reply to doppler
said by doppler:

This is why MD5 is insecure. Collisions!

All hash algorithms have collisions.

1. If the hash size is smaller than the plaintext size, there must be collisions, by simple numerical fact.

2. If the hash size is larger than the plaintext size, it's not much use as a hash function.

The security question is, how difficult is it to modify the plaintext in any desired way while preserving the same hash?


novaflare
The Dragon Was Here
Premium
join:2002-01-24
Barberton, OH
Well hes not even worried about the security side of things. Just about how common or uncommon it may be for there to be a collision in the real world with out a person trying to do it. I would say not very likely but it has happened hell it had to happen. Now the only real question is did it happening ever effect any one? That would be even more unlikely and possibly has not happened yet. But again at some point in time it has to happen.

In fact the only way for it to be prevented is to stop using md5 before it does happen. While the number is huge it is finite. Meaning given enough time it will have to by it's nature have to happen.

What i do know is most sites where i download from say to use the hash to check the download and make sure the file is ok. Not so much as a security measure but one to protect you from saving you a return trip. Hell how many times have you me or any one downloaded a file at work and found out when we got home it was corrupted.


therube

join:2004-11-11
Randallstown, MD
Reviews:
·Comcast
·Verizon Online DSL

2 edits
reply to doppler
> Is it possible for two files to hash-out using MD5 and not at all be the same?

Yes.

> How likely without really trying to get a match?

IMO, not likely, so MD5 (or anything greater) should be more then sufficient.

If you examine the demo files (hello, erase, linked to by switchman), you will see there are only 6 bytes that are different. But ... did they occur by "chance" or were they "forced" .

doppler

join:2003-03-31
Blue Point, NY
reply to Pjr
said by Pjr:

Why not use SHA if you're concerned about the very remote chance of a collision?

Ever look at the back of a lotto megabucks ticket. It always quotes a really large
number of chances (against you) to win something.

Then I always remember the guy, who won a big pot. Only to win again the following
year. The news always reports this as a billions:1 chance of happening. Before you
say yea, it's bound to happen. This happen not just only once!

Theoretical chance and odds are always in action. Do I cross the street now, or is
there a chance I get hit by a car. Getting hit by lightning is not as rare as once
thought. NASA once thought it was so remote to have a problem with the shuttle.
Not listening to there engineers who finally put the odds at 1 in 200. We all know
how that story turned out. There is a great TV movie on this subject. Please
view it. It says a lot about how P.C. is very dangerous when playing the odds.

dave
Premium,MVM
join:2000-05-04
not in ohio
kudos:8
Reviews:
·Verizon FiOS
reply to doppler
said by doppler:

How likely without really trying
to get a match?

Assuming that MD5 has good distribution of hashes, and assuming random input data, then the likelihood is 1 in 2^128.


novaflare
The Dragon Was Here
Premium
join:2002-01-24
Barberton, OH
Heres a question how many different files are out there on the net etc. I bet the number is great than 2^128 by a fair margin.


Snowy
Premium
join:2003-04-05
Kailua, HI
kudos:6
Reviews:
·Time Warner Cable
·Clearwire Wireless
reply to doppler
said by doppler:

Theoretical chance and odds are always in action. Do I ...

There's really two questions here that don't share a common answer, IMO.
1. Can a collision happen?
2. Can it happen to me?


leibold
Premium,MVM
join:2002-07-09
Sunnyvale, CA
kudos:10
Reviews:
·SONIC.NET
reply to doppler
said by doppler:

2^128 power is a very large number. If it's possible. How likely without really trying to get a match?

I know of one company making a storage product with de-duplication feature who found out that "extremely unlikely" turned into "often" when the number of files being stored by their customers became very large. They doubled the size of the hash to greatly reduce the frequency of those hash collisions (there is no real fix for this kind of problem).

For those that perhaps don't know what de-duplication is and does, it basically discards any data that has a matching hash signature as a duplicate of data that is already stored. For performance reasons, typically only the signature is compared not the actual data!
--
Got some spare cpu cycles ? Join Team Helix or Team Starfire!


jvmorris
I Am The Man Who Was Not There.
Premium,MVM
join:2001-04-03
Reston, VA
kudos:1
Yes, that's one of the uses of NIS FileCheck, which is now over ten years old. Even then, it was all too easy for a single PC to have duplicate files on it without a user being aware of it.

The real problem would occur when a user updated one copy of a file, but then retrieved the other (un-updated, if that's a word) file without realizing.
--
Regards,
Joseph V. Morris


novaflare
The Dragon Was Here
Premium
join:2002-01-24
Barberton, OH
reply to leibold
said by leibold:

typically only the signature is compared not the actual data!

Um Doh!

Not good i can see very important data getting tossed like this...

OMG what if google tried this it could decimate google and most of the net lol

But yeh it has to happen of btw this is a 1 followed by 128 zeros ....

100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

So basically we are talking 2 sets of those. While that is a stupid huge number it is not so big as to make a natural collision impossible or even all that unlikely for 2 files to have the same md5 how ever how often it will happen in practice in other words with 1 mil users dling from 1mil servers how often will it happen?

I think it has happened and has probably just skipped detection as it had no ill effect. Hell honestly how could you even tell? Unless the program or file you got was not the right one you would not notice it. If it did happen and it was the wrong file the person may have in the end chocked it up to them not knowing how to use the check sum or what have you. In that case it was detected but ignored.


Snowy
Premium
join:2003-04-05
Kailua, HI
kudos:6
Reviews:
·Time Warner Cable
·Clearwire Wireless
reply to leibold
said by leibold:

For those that perhaps don't know what de-duplication is and does, it basically discards any data that has a matching hash signature as a duplicate of data that is already stored. For performance reasons, typically only the signature is compared not the actual data!

I wasn't aware that was a commercial service, thanks.
I suppose the dupes are secluded so a set human eyes can make the final determination rather than actually just being discarded?


therube

join:2004-11-11
Randallstown, MD
Reviews:
·Comcast
·Verizon Online DSL

1 recommendation

> suppose the dupes are secluded so a set human eyes can make the final determination

I would doubt it.
If hashes were not sufficient, they'd just compare the files byte-by-byte to confirm equality.

I would think it fairly common to do this.

"One" source file, & "owned" by many.
Megaupload/whateverthey'renowcalled, someone uploads a video, they generate a hash, someone else uploads a video, then compare the hash, find a duplicate, deletes the upload & provides a link to the second uploader. All transparent to the user. Third user uploads the same video, but with one byte difference. That creates a new, unique hash. So the file is stored. Or, suspecting the files to be very similar, could do a diff of the two, & only store the diff of the second, recreating a "whole" file either from orig source or the source & diff, as needed for each particular user.


novaflare
The Dragon Was Here
Premium
join:2002-01-24
Barberton, OH
Your talking about what is called byte level patching essentially. Most games use this method to decrease the size of updates. I have been in so many beta tests though that it has become a habit to do a fresh install after a few large updates. As with byte level patching you can end up with errors making their way in and building up over time.

For vids i would suspect this would take allot of byte level patches to cause any real problem.

I know as a end user downloaded i would not want to be downloading a program that was byte level patched with each knew upload by the developer. But getting off track here i think.

dave
Premium,MVM
join:2000-05-04
not in ohio
kudos:8
Reviews:
·Verizon FiOS
reply to Snowy
said by Snowy:

I suppose the dupes are secluded so a set human eyes can make the final determination rather than actually just being discarded?

No - typically there are no 'files' at that level. Much dedup occurs at the 'block' level: if you already have that block stored, no need to store it again.

So one logical block on storage might be a member of several files, quite possibly files with different content except for the coincidence of a couple of blocks.

Commercial storage vendors have been doing this for years.

Windows Home Server has always dedup'd backups.


Snowy
Premium
join:2003-04-05
Kailua, HI
kudos:6
Reviews:
·Time Warner Cable
·Clearwire Wireless
said by dave:

So one logical block on storage might be a member of several files, quite possibly files with different content except for the coincidence of a couple of blocks.

If I ever had a question of
"Am I a Geek?"
I need not look any further than this thread to find the answer.

I wasn't aware of the science/wide spread implementation that goes into de-duping prior to this thread but I'm totally fascinated by the data Google returns on the subject.


dslcreature
Premium
join:2010-07-10
Seattle, WA
reply to doppler
Doppler, yes probability of a collision depends mostly on number of separate entries in your database required to be unique. To provide some idea of chances involved:

One-in-1-billion chance of an MD5 collision requires about 825 trillion independent entries
One-in-1-trillion chance @ ~26 trillion entries
One-in-10-trillion chance @ ~8 trillion entries
One-in-100-trillion chance @ ~3 trillion entries

Obviously more entries brings higher chances for collision between any given two. Google 'birthday problem' for details calculating chances.

8744675

join:2000-10-10
Decatur, GA
reply to doppler
Where I work, we receive and process billing files daily from over 200 clients, and we often receive duplicate files by accident. Processing and mailing duplicate bills is our worst fear, so I wrote a program that calculates and compares the MD5 hash for each incoming billing file to a database of MD5 hashes of each client's previously processed files, to determine if the file is a duplicate. It catches several duplicates per week and saves our butts.

We compare only the hashes for files sent by a given client and since 2001 we've never had a false duplicate on over 1.9 million files. Comparing hashes of all clients, I found 214 duplicate hashes in the database, but those are files that were reprocessed and the duplicate hash entry was forced into the database.

Be aware that a single byte difference can cause 2 essentially duplicate files not to be detected as duplicate. Many time an end of file marker or an insignificant byte can be dropped during transmission that will make the files have different hashes even they contain the same actual useable data. For our purposes, that can be a problem. I'd like to find an algorithm or method that would catch files with just a 1 or 2 byte difference as duplicates.


therube

join:2004-11-11
Randallstown, MD
Reviews:
·Comcast
·Verizon Online DSL

4 edits
There are duplicate file checkers that allow you to look for "similar" files, with a certain percentage difference.

Particular algorithm or effectiveness, suppose you'd have to try & see if it works for you.

A fast and accurate method for comparing similarity between text documents

If it's only (potentially) 1 or 2 bytes at the end of a file, you could do something like this:

rem   print.bat 1999-06-02 SjB
rem   somewhat like the unix lpr spooler
 
rem   FF=0 - formfeed is off
rem   FF=1 - formfeed is on
set FF=0
 
rem   search for a 'ff' as the last few (3) characters of a file
rem   3-chars because the file could end in "$" or "0d0a$"
 
rem   hard coded the ^L, would have liked to use a $0d or \h0d
rem   or something like that in the grep
 
\bin\unix\tail -c3 %1% > \tmp\tail.out
\bin\unix\grep "" \tmp\tail.out
 
rem   grep exit codes:  0=matches found, 1=NO matches found
rem   other grep's may or may not have exit codes or may or
rem   may not have differing exit codes
 
rem   if grep did NOT find a 'ff', set ff to ON
 
if errorlevel 1 set FF=1
 
rem   if %FF%==1 echo "sending a FF"
rem   set
 
COPY %1% PRN:
IF %FF%==1 ECHO  ^L  > PRN:
 
del \tmp\tail.out
set FF=
 
(Trying to remember... Something like some programs output a FormFeed at the end of their output. I just checked for the FF, & if there, did nothing [to avoid a blank page from printing out], if not, I added one [so the page would advance, formfeed].)

Another thought, Near-duplicates.

doppler

join:2003-03-31
Blue Point, NY
reply to doppler
Well this has been revived and became more interesting. Here is a little background
why I am using MD5 hashes. Which happens to be 1 tool, in the search for dupes.

I am a bit of a otaku for realistic artwork computer generated. Deviantart.com is a
great source if you have an interest. The greatest problem with this artwork, is
the artist will cross-post into groups. These groups are artist conclaves where
members can post new stuff. These groups are not only for artists, any member
of deviantart can join (watch). It's likely that 1 artist can cross-post into any
number of groups. That leads to duplicates. Here is a sample of processes to
find these dupes.

#1. The great MD5 hash. First pass, to find exact copies. Won't find the slightly
different artwork.

#2. Media duplication detect through algorithms. There are two program of use.

First: »www.duplicate-finder.com/photo.html
Second: »www.xnview.org
Both are free to use. Both have internal viewers, strengths and weaknesses too.
If you want to compare video media, »www.video-comparer.com not free
but it's the only working and workable solution I have found.

#3. Programs I have written too handle duplicate names, database the hashes
and generate reports.

Even with this layered "all mighty" search for dupes. Dupes will still get by. I can
only talk about media. Duplication detection is not an easy subject.

Detection of text documents, as simple as text is. Is a bigger nightmare. One
"period" in a change will throw off everything. Better off re-enforcing the accounting
software. AKA, Everything must have an invoice #. Don't pay off an invoice twice.


antdude
A Ninja Ant
Premium,VIP
join:2001-03-25
United State
kudos:5
reply to doppler

8744675

join:2000-10-10
Decatur, GA
reply to therube
Thanks. I'll have to test the check for a FF or the last byte of the files to see how that works.