How do AV engines search files for known signatures so efficiently?

How do AV engines search files for known signatures so efficiently? - search

Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.

UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based
detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used

Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.
Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.
The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.

I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.
For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.

Related

Turning unique filepath into unique integer

I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?

You could try the inode number:
fs.statSync(filename).ino

#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.

This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).

Given a hashing algorithm, is there a more efficient way to 'unhash' besides bruteforce?

So I have the code for a hashing function, and from the looks of it, there's no way to simply unhash it (lots of bitwise ANDs, ORs, Shifts, etc). My question is, if I need to find out the original value before being hashed, is there a more efficient way than just brute forcing a set of possible values?
Thanks!
EDIT: I should add that in my case, the original message will never be longer than several characters, for my purposes.
EDIT2: Out of curiosity, are there any ways to do this on the run, without precomputed tables?

Yes; rainbow table attacks. This is especially true for hashes of shorter strings. i.e. hashes of small strings like 'true' 'false' 'etc' can be stored in a dictionary and can be used as a comparison table. This speeds up cracking process considerably. Also if the hash size is short (i.e. MD5) the algorithm becomes especially easy to crack. Of course, the way around this issue is combining 'cryptographic salts' with passwords, before hashing them.
There are two very good sources of info on the matter: Coding Horror: Rainbow Hash Cracking and
Wikipedia: Rainbow table
Edit: Rainbox tables can tage tens of gigabytes so downloading (or reproducing) them may take weeks just to make simple tests. Instead, there seems to be some online tools for reversing simple hashes: http://www.onlinehashcrack.com/ (i.e. try to reverse 463C8A7593A8A79078CB5C119424E62A which is MD5 hash of the word 'crack')

"Unhashing" is called a "preimage attack": given a hash output, find a corresponding input.
If the hash function is "secure" then there is no better attack than trying possible inputs until a hit is found; for a hash function with a n-bit output, the average number of hash function invocations will be about 2n, i.e. Way Too Much for current earth-based technology if n is greater than 180 or so. To state it otherwise: if an attack method faster than this brute force method is found, for a given hash function, then the hash function is deemed irreparably broken.
MD5 is considered broken, but for other weaknesses (there is a published method for preimages with cost 2123.4, which is thus about 24 times faster than the brute force cost -- but it is still so far in the technologically unfeasible that it cannot be confirmed).
When the hash function input is known to be part of a relatively small space (e.g. it is a "password", so it could fit in the brain of a human user), then one can optimize preimage attacks by using precomputed tables: the attacker still has to pay the search cost once, but he can reuse his tables to attack multiple instances. Rainbow tables are precomputed tables with a space-efficient compressed representation: with rainbow tables, the bottleneck for the attacker is CPU power, not the size of his hard disks.

Assuming the "normal case", the original message will be many times longer than the hash. Therefore, it is in principle absolutely impossible to derive the message from the hash, simply because you cannot calculate information that is not there.
However, you can guess what's probably the right message, and there exist techniques to accelerate this process for common messages (such as passwords), for example rainbow tables. It is very likely that if something that looks sensible is the right message if the hash matches.
Finally, it may not be necessary at all to find the good message as long as one can be found which will pass. This is the subject of a known attack on MD5. This attack lets you create a different message which gives the same hash.
Whether this is a security problem or not depends on what exactly you use the hash for.

This may sound trivial, but if you have the code to the hashing function, you could always override a hash table container class's hash() function (or similar, depending on your programming language and environment). That way, you can hash strings of say 3 characters or less, and then you can store the hash as a key by which you obtain the original string, which appears to be exactly what you want. Use this method to construct your own rainbow table, I suppose. If you have the code to the program environment in which you want to find these values out, you could always modify it to store hashes in the hash table.

Parsing bulk text with Hadoop: best practices for generating keys

I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.
I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).
Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.
Aντίο,

Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.

Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.

Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.
You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)

Performance of Long IDs

I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:
"000ab56cb24aef9b817ac98d55695c6a"
Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).
For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?

The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.
It also affects storage, as longer keys take more space, space takes time.
However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.
However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).
So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.
Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.
If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.

From the CouchDB: The definitive guide book:
I need to draw a picture of this at
some point, but the reason is if you
think of the idealized btree, when you
use UUID’s you might be hitting any
number of root nodes in that tree, so
with the append only nature you have
to write each of those nodes and
everything above it in the tree. but
if you use monotonically increasing
id’s then you’re invalidating the same
path down the right hand side of the
tree thus minimizing the number of
nodes that need to be rewritten. would
be just the same for monotonically
decreasing as well. and it should
technically work if you’re updates can
be guaranteed to hit one or two nodes
in the inside of the tree, though
that’s much harder to prove.
So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.

Is it possible to create a forged file which has the same checksums using two different algorithms?

I was a bit inspired by this blog entry http://blogs.technet.com/dmelanchthon/archive/2009/07/23/windows-7-rtm.aspx (German)
The current notion is that md5 and sha1 are both somewhat broken. Not easily and fast, but at least for md5 in the range of a practical possibility. (I'm not at all a crypto expert, so maybe I'm wrong in stuff like that).
So I asked myself if it would be possible to create a file A' which has the same size, the same md5 sum, and the same sha1 sum as the original file A.
First, would it be possible at all?
Second, would it be possible in reality, with current hardware/software?
If not, wouldn't be the easiest way to provide assurance of the integrity of a file to use always two different algorithms, even if they have some kind of weakness?
Updated:
Just to clarify: the idea is to have a file A and a file A' which fullfills the conditions:
size(A) == size(A') && md5sum(A) == md5sum(A') && sha1sum(A) == sha1sum(A')

"Would it be possible at all?" - yes, if the total size of the checksums is smaller than the total size of the file, it is impossible to avoid collisions.
"would it be possible in reality, with current hardware/software?" - if it is feasible to construct a text to match a given checksum for each of the checksums in use, then yes.
See wikipedia on concatenation of cryptographic hash functions, which is also a useful term to google for.
From that page:
"However, for Merkle-Damgård hash
functions, the concatenated function
is only as strong as the best
component, not stronger. Joux noted
that 2-collisions lead to
n-collisions: if it is feasible to
find two messages with the same MD5
hash, it is effectively no more
difficult to find as many messages as
the attacker desires with identical
MD5 hashes. Among the n messages with
the same MD5 hash, there is likely to
be a collision in SHA-1. The
additional work needed to find the
SHA-1 collision (beyond the
exponential birthday search) is
polynomial. This argument is
summarized by Finney."

For a naive answer, we'd have make some (incorrect) assumptions:
Both the SHA1 and MD5 hashing algorithms result in an even distribution of hash values for a set of random inputs
Algorithm details aside--a random input string has an equally likely chance of producing any hash value
(Basically, no clumping and nicely distributed domains.)
If the probability of discovering a string that collides with another's SHA1 hash is p1, and similarly p2 for MD5, the naive answer is the probability of finding one that collides with both is p1*p2.
However, the hashes are both broken, so we know our assumptions are wrong.
The hashes have clumping, are more sensitive to changes with some data than others, and in other words, aren't perfect. On the other hand, a perfect, non-broken hashing algorithm will have the above properties, and that's exactly what makes it hard to find collisions. They're random.
The probability intrinsically depends on the properties of the algorithm--basically, since our assumptions aren't valid, we can't "easily" determine how hard it is. In fact, the difficultly of finding input that collides likely depends very strongly on the characteristics of the input string itself. Some may be relatively easy (but still probably impractical on today's hardware), and due to the different nature of the two algorithms, some may actually be impossible.

So I asked myself if it would be
possible to create a file A' which has
the same size, the same md5 sum, and
the same sha1 sum as the original file
A.
Yes, make a copy of the file.
Other than that, not without large amounts of computing resources to check tons of permutations (assuming the file size is non-trivial).
You can think of it like this:
If file size increases by n, the likelihood of a possible fake increases, but the computing costs necessary to test the combinations increases exponentially by 2^n.
So the bigger your file is, the more likely there is a dupe out there, but the less likely you are at finding it.

In theory yes you can have it, in practice it's hell of a collusion. In practice no one even able to create a SHA1 collusion let alone MD5 + SHA1 + Size at the same time. This combination is simply impossible right now without having the whole computer power in the world and run it for a while.
Although in the close future we might see more vulnerabilities in SHA1 and MD5. And with the support of better hardware (especially GPU) why not.

In theory you could do this. In practice, if you started from the two checksums provided by MD5 and SHA1 and tried to create a file that produced the same two checksums - it would be very difficult (many times more difficult than creating a file that produced the same MD5 checksum, or SHA1 checksum in isolation).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string