Data structure for fast full text search - string

A trie seems like it would work for small strings, but not for large documents, so not sure (1-100's of pages of text). Maybe it is possible to combine an inverted index with a suffix tree to get the best of both worlds. Or perhaps using a b-tree with words stored as nodes, and a trie for each node. Not sure. Wondering what a good data structure would be (b-tree, linked-list, etc.).
I'm thinking of searching documents such as regular books, web pages, and source code, so the idea of storing just words in an inverted index doesn't seem quite right. Would be helpful to know if you need alternative solutions for each or if there is a general one that works for them all, or a combination of them.

You do need an inverted index at the end of the day for interleaving matching results from each of your query terms but an inverted index can be built either from Trie or a Hash Map. A trie would allow fuzzy look-ups, while an hash map based inverted-index would only allow an exact look up of a token.
To optimize for memory usage, you can use memory optimized versions of Trie like Radix Tree or Adaptive Radix Tree (ART). I've had great success using ART for an open source fuzzy search engine project I've been working on: https://github.com/typesense/typesense
With Typesense, I was able to index about 1 million Hacker News titles in about 165 MB of RAM (uncompressed size on disk was 85 MB). You can probably squeeze it in even further if your use case is more specific and don't need some metadata fields I added to the data structure.

Related

How does Lucene organize and walk through the inverted index?

In SQL an index is typically some kind of balanced tree (ordered nodes that point to the real table row to make searching possible in O(log n)). Walking through such a tree tree actually is the searching process.
Now Lucene uses an inverted index with term frequencies: It stores for each term how often it occurs in which documents. This is easy to understand. But this doesn't explain how a search on such an index is actually performed.
The search string is analyzed and split the same way into terms, of course, and then "the index is searched" for these terms to find the documents containing them – but how? Is the Lucene index itself also ordered and tree-like organized in some way for O(log n) to be possible? Or is walking the Lucene index on search-time actually linear, so O(n)?
There is no simple answer to this. First, because the internal format got improved from release to release and second, because with the advent of Lucene 4 configurable codecs were introduced which serve as an abstraction between the logical format and the actual physical format.
An index is made up of shards and replicas, each of them being a Lucene index itself. A Lucene index is then made of of multiple segments, whereas each segment is again a Lucene index. A segment is read-only and made up of multiple artefacts which can be held in the file system or in RAM.
What's in a Lucene index from Adrien Grand is an excellent presentation on the organisation of a Lucene index. This blog article from Michael McCandless and this blog article from Elastic are about codecs introduced with Lucene 4.
So querying a Lucene index actually results in querying multiple segments in parallel making use of a specific codec. A codec can represent a structure in file system or in RAM, typically optimized/compressed for a particular need. Internal format can be anything from a tree, a Hashmap, a Finite State Machine just to name a few. As soon as you make use of wildcard characters in the search your query ("?" or "*") this automatically results in a more or less deep traversal in your index.

How do AV engines search files for known signatures so efficiently?

Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.
UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based
detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used
Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.
Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.
The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.
I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.
For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.

Which search algorithm to prefer?

Binary search algorithm has a big O value of O(log n) and a sequential search has a big O value of O(n). But we need sorting algorithm before a binary search and best big O value for a sorting algotithm is O(n.log n). So, effectively, the big O value for binary search is O(n.log n), which is greater than that of the sequential search. So, which one is preferred as searching algo?
In practice it depends on how often you search. If you have to search millions of times, you want binary search, even if you have to pay the upfront cost of sorting. It depends on your use case. With binary search, you also ensure that your inserts keep the list sorted, so they become slower as well.
If you need to do a lot of inserts, and very few searches, sequential search might be faster.
Keep in mind that a lot of this won't even be noticeable until you are working with a lot of data.
Sequential search is practically rarely used in optimised applications. Because it is usually much better to find an appropriate data structure then using the one that provides a frequently used search in O(n).
For example, red-black tree is a special kind of balanced binary tree which provides insert/delete/search all in O(log n). So it is fast to create, fill it in and search.

Fastest Search and Insertion

I ve got a million objects . Which is the fastest way to lookup a particular object with name as key also the fastest way to perfrom insertion ? would hashing be sufficient?
Probably a hash table, assuming you don't need anything other than key based access. Make sure that the hashing of the key is good enough (as to minimise collisions) and the table is large enough (for the same reason).
It will depend on how often your need to do a lookup and how often you need to insert elements.
If you often have to insert elements then a linked list would perform better.
If you often have to search for elements, an hash table is more efficient. Perhaps, you can have both - your main data as a linked list, and an hash table which will serve as an index to the list.
You can also use a binary search tree. BST has the advantage of fast search and fast insertion too. Use the key to route your way in the tree and build the tree node to have the value.
Use BST in favor of hash tables if you are not sure about the balance of the operation (ie: looking up a key and value pairs, insertion, etc) and if you (based on your analysis) know that keys may collide frequently in the hash table (which will cause bad performance for the hash table).
Several Structures exist here that you can use. Each has it's advantage and disadvantage.
A HashTable will have a great lookup time and insertion time, provided you have a table that minimizes collision. If not, then lookup/insertion can lead to a lot more time.
A Binary Search Tree has ln(n) insertion and lookup, provided that it's balanced. Sometimes the balancing can cause the insertion to take a bit longer then ln(n), depending on the BST you go with.
Can go with B+ tree, it guarantees lesser search complexity ( since you reach leaf nodes fast, height = log n to base k, k = degree of nodes). The databases have similar requirement and they use B+ trees to maintain and retrieve data.

Quick Filter List

Everyone is familiar with this functionality. If you open up the the outlook address book and start typing a name, the list below the searchbox instantly filters to only contain items that match your query. .NET Reflector has a similar feature when you're browsing types ... you start typing, and regardless of how large the underlying assembly is that you're browsing, it's near instantaneous.
I've always kind of wondered what the secret sauce was here. How is it so fast? I imagine there are also different algorithms if the data is present in-memory, or if they need to be fetched from some external source (ie. DB, searching some file, etc.).
I'm not sure if this would be relevant, but if there are resources out there, I'm particularly interested how one might do this with WinForms ... but if you know of general resources, I'm interested in those as well :-)
What is the most common use of the trie data structure?
A Trie is basically a tree-structure for storing a large list of similar strings, which provides fast lookup of strings (like a hashtable) and allows you to iterate over them in alphabetical order.
Image from: http://en.wikipedia.org/wiki/Trie:
In this case, the Trie stores the strings:
i
in
inn
to
tea
ten
For any prefix that you enter (for example, 't', or 'te'), you can easily lookup all of the words that start with that prefix. More importantly, lookups are dependent on the length of the string, not on how many strings are stored in the Trie. Read the wikipedia article I referenced to learn more.
The process is called full text indexing/search.
If you want to play with the algorithms and data structures for this I would recommend you read Programming Collective Intelligence for a good introduction to the field, if you just want the functionality I would recommend lucene.

Resources