 yaplejPremium join:2001-02-10 White City, OR | How to deal with a lot of data? I want to start a project to analyze/search for common patterns in VoIP traffic. The goal would be to find if there are any reoccurring patterns in VoIP traffic that would be suitable for creating a persistent dictionary/substitution compressor.
Is it even possible to create a dataset from all this data? There are 2^1280 possible unique patterns (20-160 Bytes) in a VoIP packet excluding headers. Can that many records be created in a table? What methods could be used to distribute them? Would using the first few bytes as a table identifier be suitable?
Then mapping individual instances of that particular data record and counting these occurrences might also be difficult. There would be many more occurrences than unique data records and would make for a very large table I think.
The occurrences would be uploaded in sets each with a unique identifier so if the set turned out not to be actual VoIP traffic the entire set could be purged. This would help ensure valid results.
Ideally there would be a web interface were Wireshark captures of VoIP traffic could be dumped and it would create any new unique records and map all the occurrences of those records along with what codec was captured so more specific analysis could be done on each codec.
After enough data has been collected it would then analyzed for common patterns. The result would be published and could be used to create better network optimization techniques for VoIP traffic.
Legal aspects removed the Wireshark capture could be processed then sorted randomly before going into the database to ensure the call could not be "replayed" after being uploaded. -- sk_buff what?
Open Source Network Accelerators »www.trafficsqueezer.org »www.opennop.org
|
|
 SteveI know your IP addressConsultant join:2001-03-10 Yorba Linda, CA kudos:5 | said by yaplej: There are 2^1280 possible unique patterns (20-160 Bytes) in a VoIP packet excluding headers. Can that many records be created in a table? If you know how to type 2^1280, you know how to figure out how much data that actually is, right? |
|
|
|
 yaplejPremium join:2001-02-10 White City, OR | Its huge, massive, gargantuan. Just asking how it might be done. Distributed tables/databases? I mean they simulate climate changes and storms effects on the world topology with huge amounts of data. How could you effectively store this data?
It was just an idea if its not feasible no big deal. |
|
 yaplejPremium join:2001-02-10 White City, OR | It would probably be better just to insert the entire 160Byte payload as a blob and and count them somehow. If the table gets full just start a new table. If there are multiple tables it would just take the top occurrences from each then compare those.
It would only use space as particular records are inserted but its going to use 160Bytes for each record. I read that its 64 Terabyte limit for tables sizes though so that's a lot of data.
Still it would be neat challenge to deal with a huge amount of data like that. -- sk_buff what?
Open Source Network Accelerators »www.trafficsqueezer.org »www.opennop.org
|
|
 cdruGo ColtsPremium,MVM join:2003-05-14 Fort Wayne, IN kudos:7 | said by yaplej:Still it would be neat challenge to deal with a huge amount of data like that. There's tradeoffs with any type of compression. Quality, speed, size. Pick two. When you pick a compression level in a program like WinZip, WinRar, 7zip, etc, you have perfect quality, but slow compression for a smaller file, or fast compression for a larger file. For a video codec or image such as jpg or png, you are balancing all three.
While doing an "analysis" how you suggest is theoretically possible, in the end for VoIP data I doubt it would be very useful. VoIP needs to be as real time as possible, meaning you can't do a lot with an individual packet because you're limited with processing time, plus each packet needs to be independent of others since delivery is not guaranteed and can't be retransmitted. Certain data payloads may come up more often then others, but I'd be surprised if it was statistically significant enough to make exceptions for those packets in an effort to "compress" them. |
|
 SteveI know your IP addressConsultant join:2001-03-10 Yorba Linda, CA kudos:5 | reply to yaplej said by yaplej: It was just an idea if its not feasible no big deal. If you're a CCNA you should be able to do some back-of-the-envelope calculations to figure it out for yourself that the number is so large that you cannot even have a discussion about databases or storage.
2^1280 is around 10^385, a number which dwarfs the number of atoms in the Earth (10^50). |
|
 cdruGo ColtsPremium,MVM join:2003-05-14 Fort Wayne, IN kudos:7 | said by Steve:2^1280 is around 10^385, a number which dwarfs the number of atoms in the Earth (10^50). Not to mention the number of atoms in the universe (10^81). In fact, if every atom in the universe contained a universe of atoms that that contained a universe of atoms that contained a universe of atoms (4 nested universes), you'd still only have ~10^324 atoms. |
|
 yaplejPremium join:2001-02-10 White City, OR | Had no idea about how many atoms are in the universe. Never counted them.
So lets take this another direction. How about analyzing each wireshark stream individually for any possible patterns and only store those if any. Then analyze the collection of matched patterns for the top most common patterns.
Its a lot less storage if you only find a few patterns per session. There might not be any patterns found in a voip session in that case nothing to store. Packet captures of test calls generally are not that big so they might be able to upload analyze that call save any patterns and be done.
Seems like if you had 1000 calls to analyze you could get an idea quickly if there are any common patterns for a particular codec. -- sk_buff what?
Open Source Network Accelerators »www.trafficsqueezer.org »www.opennop.org
|
|
 jfmezeiPremium join:2007-01-03 Pointe-Claire, QC kudos:22 | You use RAM to do the pattern matching and then a database to store number of occurrences of patterns that are of interest.
Remember that you should decompress the data in RAM before you do the pattern matching. Trying to compress compressed data often results in more data. |
|
 cdruGo ColtsPremium,MVM join:2003-05-14 Fort Wayne, IN kudos:7 | reply to yaplej said by yaplej:So lets take this another direction. How about analyzing each wireshark stream individually for any possible patterns and only store those if any. Then analyze the collection of matched patterns for the top most common patterns. You realize you are just describing almost any general data compression algorithm out there. See reference: pkzip, zlib, deflate, etc...
Its a lot less storage if you only find a few patterns per session. So after a few packets you notice a pattern that could be substituted. First, it's a lousy codec if that pattern is obvious and not a function of just the particular voice characteristics of that call. But even beside that, by the time you realize "Hey, there's a pattern here", the packets should have already been sent. If they haven't, they are now useless as part of the conversation.
But even if you do realize that 001010...1010 is a repeating pattern, you now must tell the other end that this particular pattern repeats and that if they see a particular bitpattern "key" come across, it should be replaced with the expanded value. Of course, this takes some extra data across the line. But it might save us some in the long run.
But that particular bit pattern that we've noticed repeats...may never repeat again. We don't know if it will or won't as it's a "real time" protocol and we can't look ahead, or go back in the past. We can only look at a very small bit of time. And the overhead of dictionary-based compression (such as system resources, sharing the dictionary across the link, and encoding the substituted values) becomes far more then what we would save.
Seems like if you had 1000 calls to analyze you could get an idea quickly if there are any common patterns for a particular codec. I would suggest reading up on data compression algorithms first. I have a feeling that there is a lot more to compression then what you might know already. |
|