doppler join:2003-03-31 Blue Point, NY
1 recommendation |
Theoretical question about MD5 sum, Not about how secure it's not.I use to use CRC-32 and size of a file to determine if files are the same. This proved to be lacking. I rewrote some of my database to use the MD5 sum of a file.
The theoretical question is: Is it possible for two files to hash-out using MD5 and not at all be the same?
2^128 power is a very large number. If it's possible. How likely without really trying to get a match?
This question is not how secure MD5 is, it has been proven not to be. Using a SHA version of hash, is just too overboard for my usage.
Thanks |
|
|
|
|
1 recommendation |
to doppler
I think the OP was referring to Hash Collisions. See the link below » www.mscs.dal.ca/~selinge ··· llision/ |
|
doppler join:2003-03-31 Blue Point, NY |
This is why MD5 is insecure. Collisions! Without trying to create a collision, how likely could one occur naturally? That's my real question. |
|
Nanaki (banned)aka novaflare. pull punches? Na join:2002-01-24 Akron, OH |
Nanaki (banned)
Member
2014-Jul-8 12:23 pm
I would say with as many files as there are out there it has almost had to have happened naturally. Yeh 2 to the 128 is massively huge how ever the total number of files on the web alone not even counting internal documents it has i would think happened at least once.
How likely not very. MD5 is a case of it is good enough for the majority of users for day to day stuff. Mostly it has come to be used for mostly checking to make sure a file is not corrupted and less for security. Least in day to day things |
|
jvmorrisI Am The Man Who Was Not There. MVM join:2001-04-03 Reston, VA |
to doppler
See table near the bottom of » en.wikipedia.org/wiki/Se ··· lgorithm . If I recall correctly, there were published articles circa 2000 (if not earlier) showing that it was possible to construct a file with any desired MD5 hash, including that of another file. Now, the real question is whether you can construct a file of the same size and having usable information that differs from the original. I never saw that question addressed, but then I went to SHA-1 around 2000. Computing the SHA-1 hash of a file is not difficult and I was doing it on 8-bit microcomputers routinely. Indeed, I'd hash every file on the computer and then search for duplicates back then. (That got tedious, but it was certainly doable.) A flaw in SHA-1 was documented sometime prior to 2010 and its further use was deprecated by that date. |
|
PjrDon't Panic join:2005-12-11 UK |
to doppler
said by doppler:Using a SHA version of hash, is just too overboard for my usage. Why not use SHA if you're concerned about the very remote chance of a collision? |
|
|
jvmorrisI Am The Man Who Was Not There. MVM join:2001-04-03 Reston, VA |
to switchman
That's a very good article, well worth reading. Thank you. |
|
dave Premium Member join:2000-05-04 not in ohio
1 recommendation |
to doppler
said by doppler:This is why MD5 is insecure. Collisions! All hash algorithms have collisions. 1. If the hash size is smaller than the plaintext size, there must be collisions, by simple numerical fact. 2. If the hash size is larger than the plaintext size, it's not much use as a hash function. The security question is, how difficult is it to modify the plaintext in any desired way while preserving the same hash? |
|
Nanaki (banned)aka novaflare. pull punches? Na join:2002-01-24 Akron, OH |
Nanaki (banned)
Member
2014-Jul-8 1:10 pm
Well hes not even worried about the security side of things. Just about how common or uncommon it may be for there to be a collision in the real world with out a person trying to do it. I would say not very likely but it has happened hell it had to happen. Now the only real question is did it happening ever effect any one? That would be even more unlikely and possibly has not happened yet. But again at some point in time it has to happen.
In fact the only way for it to be prevented is to stop using md5 before it does happen. While the number is huge it is finite. Meaning given enough time it will have to by it's nature have to happen.
What i do know is most sites where i download from say to use the hash to check the download and make sure the file is ok. Not so much as a security measure but one to protect you from saving you a return trip. Hell how many times have you me or any one downloaded a file at work and found out when we got home it was corrupted. |
|
therube join:2004-11-11 Randallstown, MD 2 edits |
to doppler
> Is it possible for two files to hash-out using MD5 and not at all be the same? Yes. > How likely without really trying to get a match? IMO, not likely, so MD5 (or anything greater) should be more then sufficient. If you examine the demo files (hello, erase, linked to by switchman), you will see there are only 6 bytes that are different. But ... did they occur by "chance" or were they "forced" . |
|
doppler join:2003-03-31 Blue Point, NY |
to Pjr
said by Pjr:Why not use SHA if you're concerned about the very remote chance of a collision? Ever look at the back of a lotto megabucks ticket. It always quotes a really large number of chances (against you) to win something. Then I always remember the guy, who won a big pot. Only to win again the following year. The news always reports this as a billions:1 chance of happening. Before you say yea, it's bound to happen. This happen not just only once! Theoretical chance and odds are always in action. Do I cross the street now, or is there a chance I get hit by a car. Getting hit by lightning is not as rare as once thought. NASA once thought it was so remote to have a problem with the shuttle. Not listening to there engineers who finally put the odds at 1 in 200. We all know how that story turned out. There is a great TV movie on this subject. Please view it. It says a lot about how P.C. is very dangerous when playing the odds. |
|
dave Premium Member join:2000-05-04 not in ohio |
to doppler
said by doppler:How likely without really trying to get a match? Assuming that MD5 has good distribution of hashes, and assuming random input data, then the likelihood is 1 in 2^128. |
|
Nanaki (banned)aka novaflare. pull punches? Na join:2002-01-24 Akron, OH |
Nanaki (banned)
Member
2014-Jul-8 1:49 pm
Heres a question how many different files are out there on the net etc. I bet the number is great than 2^128 by a fair margin. |
|
SnowyLock him up!!! Premium Member join:2003-04-05 Kailua, HI |
to doppler
said by doppler:Theoretical chance and odds are always in action. Do I ... There's really two questions here that don't share a common answer, IMO. 1. Can a collision happen?2. Can it happen to me? |
|
leibold MVM join:2002-07-09 Sunnyvale, CA Netgear CG3000DCR ZyXEL P-663HN-51
|
to doppler
said by doppler:2^128 power is a very large number. If it's possible. How likely without really trying to get a match? I know of one company making a storage product with de-duplication feature who found out that "extremely unlikely" turned into "often" when the number of files being stored by their customers became very large. They doubled the size of the hash to greatly reduce the frequency of those hash collisions (there is no real fix for this kind of problem). For those that perhaps don't know what de-duplication is and does, it basically discards any data that has a matching hash signature as a duplicate of data that is already stored. For performance reasons, typically only the signature is compared not the actual data! |
|
jvmorrisI Am The Man Who Was Not There. MVM join:2001-04-03 Reston, VA |
Yes, that's one of the uses of NIS FileCheck, which is now over ten years old. Even then, it was all too easy for a single PC to have duplicate files on it without a user being aware of it.
The real problem would occur when a user updated one copy of a file, but then retrieved the other (un-updated, if that's a word) file without realizing. |
|
Nanaki (banned)aka novaflare. pull punches? Na join:2002-01-24 Akron, OH |
to leibold
said by leibold:typically only the signature is compared not the actual data! Um Doh! Not good i can see very important data getting tossed like this... OMG what if google tried this it could decimate google and most of the net lol But yeh it has to happen of btw this is a 1 followed by 128 zeros .... 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 So basically we are talking 2 sets of those. While that is a stupid huge number it is not so big as to make a natural collision impossible or even all that unlikely for 2 files to have the same md5 how ever how often it will happen in practice in other words with 1 mil users dling from 1mil servers how often will it happen? I think it has happened and has probably just skipped detection as it had no ill effect. Hell honestly how could you even tell? Unless the program or file you got was not the right one you would not notice it. If it did happen and it was the wrong file the person may have in the end chocked it up to them not knowing how to use the check sum or what have you. In that case it was detected but ignored. |
|
SnowyLock him up!!! Premium Member join:2003-04-05 Kailua, HI |
to leibold
said by leibold:For those that perhaps don't know what de-duplication is and does, it basically discards any data that has a matching hash signature as a duplicate of data that is already stored. For performance reasons, typically only the signature is compared not the actual data! I wasn't aware that was a commercial service, thanks. I suppose the dupes are secluded so a set human eyes can make the final determination rather than actually just being discarded? |
|
therube join:2004-11-11 Randallstown, MD
1 recommendation |
> suppose the dupes are secluded so a set human eyes can make the final determination
I would doubt it. If hashes were not sufficient, they'd just compare the files byte-by-byte to confirm equality.
I would think it fairly common to do this.
"One" source file, & "owned" by many. Megaupload/whateverthey'renowcalled, someone uploads a video, they generate a hash, someone else uploads a video, then compare the hash, find a duplicate, deletes the upload & provides a link to the second uploader. All transparent to the user. Third user uploads the same video, but with one byte difference. That creates a new, unique hash. So the file is stored. Or, suspecting the files to be very similar, could do a diff of the two, & only store the diff of the second, recreating a "whole" file either from orig source or the source & diff, as needed for each particular user. |
|
Nanaki (banned)aka novaflare. pull punches? Na join:2002-01-24 Akron, OH |
Nanaki (banned)
Member
2014-Jul-8 3:21 pm
Your talking about what is called byte level patching essentially. Most games use this method to decrease the size of updates. I have been in so many beta tests though that it has become a habit to do a fresh install after a few large updates. As with byte level patching you can end up with errors making their way in and building up over time.
For vids i would suspect this would take allot of byte level patches to cause any real problem.
I know as a end user downloaded i would not want to be downloading a program that was byte level patched with each knew upload by the developer. But getting off track here i think. |
|
dave Premium Member join:2000-05-04 not in ohio |
dave to Snowy
Premium Member
2014-Jul-8 4:02 pm
to Snowy
said by Snowy:I suppose the dupes are secluded so a set human eyes can make the final determination rather than actually just being discarded? No - typically there are no 'files' at that level. Much dedup occurs at the 'block' level: if you already have that block stored, no need to store it again. So one logical block on storage might be a member of several files, quite possibly files with different content except for the coincidence of a couple of blocks. Commercial storage vendors have been doing this for years. Windows Home Server has always dedup'd backups. |
|
SnowyLock him up!!! Premium Member join:2003-04-05 Kailua, HI |
Snowy
Premium Member
2014-Jul-8 7:43 pm
said by dave:So one logical block on storage might be a member of several files, quite possibly files with different content except for the coincidence of a couple of blocks. If I ever had a question of " Am I a Geek?"I need not look any further than this thread to find the answer. I wasn't aware of the science/wide spread implementation that goes into de-duping prior to this thread but I'm totally fascinated by the data Google returns on the subject. |
|
|
to doppler
Doppler, yes probability of a collision depends mostly on number of separate entries in your database required to be unique. To provide some idea of chances involved:
One-in-1-billion chance of an MD5 collision requires about 825 trillion independent entries One-in-1-trillion chance @ ~26 trillion entries One-in-10-trillion chance @ ~8 trillion entries One-in-100-trillion chance @ ~3 trillion entries
Obviously more entries brings higher chances for collision between any given two. Google 'birthday problem' for details calculating chances. |
|
|
to doppler
Where I work, we receive and process billing files daily from over 200 clients, and we often receive duplicate files by accident. Processing and mailing duplicate bills is our worst fear, so I wrote a program that calculates and compares the MD5 hash for each incoming billing file to a database of MD5 hashes of each client's previously processed files, to determine if the file is a duplicate. It catches several duplicates per week and saves our butts.
We compare only the hashes for files sent by a given client and since 2001 we've never had a false duplicate on over 1.9 million files. Comparing hashes of all clients, I found 214 duplicate hashes in the database, but those are files that were reprocessed and the duplicate hash entry was forced into the database.
Be aware that a single byte difference can cause 2 essentially duplicate files not to be detected as duplicate. Many time an end of file marker or an insignificant byte can be dropped during transmission that will make the files have different hashes even they contain the same actual useable data. For our purposes, that can be a problem. I'd like to find an algorithm or method that would catch files with just a 1 or 2 byte difference as duplicates. |
|
therube join:2004-11-11 Randallstown, MD 4 edits |
therube
Member
2014-Jul-14 11:58 am
There are duplicate file checkers that allow you to look for "similar" files, with a certain percentage difference. Particular algorithm or effectiveness, suppose you'd have to try & see if it works for you. A fast and accurate method for comparing similarity between text documentsIf it's only (potentially) 1 or 2 bytes at the end of a file, you could do something like this: rem print.bat 1999-06-02 SjB
rem somewhat like the unix lpr spooler
rem FF=0 - formfeed is off
rem FF=1 - formfeed is on
set FF=0
rem search for a 'ff' as the last few (3) characters of a file
rem 3-chars because the file could end in "$" or "0d0a$"
rem hard coded the ^L, would have liked to use a $0d or \h0d
rem or something like that in the grep
\bin\unix\tail -c3 %1% > \tmp\tail.out
\bin\unix\grep "" \tmp\tail.out
rem grep exit codes: 0=matches found, 1=NO matches found
rem other grep's may or may not have exit codes or may or
rem may not have differing exit codes
rem if grep did NOT find a 'ff', set ff to ON
if errorlevel 1 set FF=1
rem if %FF%==1 echo "sending a FF"
rem set
COPY %1% PRN:
IF %FF%==1 ECHO ^L > PRN:
del \tmp\tail.out
set FF=
(Trying to remember... Something like some programs output a FormFeed at the end of their output. I just checked for the FF, & if there, did nothing [to avoid a blank page from printing out], if not, I added one [so the page would advance, formfeed].) Another thought, Near-duplicates. |
|
doppler join:2003-03-31 Blue Point, NY |
doppler
Member
2014-Jul-14 12:44 pm
Well this has been revived and became more interesting. Here is a little background why I am using MD5 hashes. Which happens to be 1 tool, in the search for dupes. I am a bit of a otaku for realistic artwork computer generated. Deviantart.com is a great source if you have an interest. The greatest problem with this artwork, is the artist will cross-post into groups. These groups are artist conclaves where members can post new stuff. These groups are not only for artists, any member of deviantart can join (watch). It's likely that 1 artist can cross-post into any number of groups. That leads to duplicates. Here is a sample of processes to find these dupes. #1. The great MD5 hash. First pass, to find exact copies. Won't find the slightly different artwork. #2. Media duplication detect through algorithms. There are two program of use. First: » www.duplicate-finder.com ··· oto.htmlSecond: » www.xnview.orgBoth are free to use. Both have internal viewers, strengths and weaknesses too. If you want to compare video media, » www.video-comparer.com not free but it's the only working and workable solution I have found. #3. Programs I have written too handle duplicate names, database the hashes and generate reports. Even with this layered "all mighty" search for dupes. Dupes will still get by. I can only talk about media. Duplication detection is not an easy subject. Detection of text documents, as simple as text is. Is a bigger nightmare. One "period" in a change will throw off everything. Better off re-enforcing the accounting software. AKA, Everything must have an invoice #. Don't pay off an invoice twice. |
|
antdudeMatrix Ant Premium Member join:2001-03-25 US |
to doppler
|
|
|
to therube
Thanks. I'll have to test the check for a FF or the last byte of the files to see how that works. |
|