Binary search algorithm has a big O value of O(log n) and a sequential search has a big O value of O(n). But we need sorting algorithm before a binary search and best big O value for a sorting algotithm is O(n.log n). So, effectively, the big O value for binary search is O(n.log n), which is greater than that of the sequential search. So, which one is preferred as searching algo?
In practice it depends on how often you search. If you have to search millions of times, you want binary search, even if you have to pay the upfront cost of sorting. It depends on your use case. With binary search, you also ensure that your inserts keep the list sorted, so they become slower as well.
If you need to do a lot of inserts, and very few searches, sequential search might be faster.
Keep in mind that a lot of this won't even be noticeable until you are working with a lot of data.
Sequential search is practically rarely used in optimised applications. Because it is usually much better to find an appropriate data structure then using the one that provides a frequently used search in O(n).
For example, red-black tree is a special kind of balanced binary tree which provides insert/delete/search all in O(log n). So it is fast to create, fill it in and search.
Related
A trie seems like it would work for small strings, but not for large documents, so not sure (1-100's of pages of text). Maybe it is possible to combine an inverted index with a suffix tree to get the best of both worlds. Or perhaps using a b-tree with words stored as nodes, and a trie for each node. Not sure. Wondering what a good data structure would be (b-tree, linked-list, etc.).
I'm thinking of searching documents such as regular books, web pages, and source code, so the idea of storing just words in an inverted index doesn't seem quite right. Would be helpful to know if you need alternative solutions for each or if there is a general one that works for them all, or a combination of them.
You do need an inverted index at the end of the day for interleaving matching results from each of your query terms but an inverted index can be built either from Trie or a Hash Map. A trie would allow fuzzy look-ups, while an hash map based inverted-index would only allow an exact look up of a token.
To optimize for memory usage, you can use memory optimized versions of Trie like Radix Tree or Adaptive Radix Tree (ART). I've had great success using ART for an open source fuzzy search engine project I've been working on: https://github.com/typesense/typesense
With Typesense, I was able to index about 1 million Hacker News titles in about 165 MB of RAM (uncompressed size on disk was 85 MB). You can probably squeeze it in even further if your use case is more specific and don't need some metadata fields I added to the data structure.
Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.
UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based
detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used
Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.
Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.
The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.
I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.
For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.
I ve got a million objects . Which is the fastest way to lookup a particular object with name as key also the fastest way to perfrom insertion ? would hashing be sufficient?
Probably a hash table, assuming you don't need anything other than key based access. Make sure that the hashing of the key is good enough (as to minimise collisions) and the table is large enough (for the same reason).
It will depend on how often your need to do a lookup and how often you need to insert elements.
If you often have to insert elements then a linked list would perform better.
If you often have to search for elements, an hash table is more efficient. Perhaps, you can have both - your main data as a linked list, and an hash table which will serve as an index to the list.
You can also use a binary search tree. BST has the advantage of fast search and fast insertion too. Use the key to route your way in the tree and build the tree node to have the value.
Use BST in favor of hash tables if you are not sure about the balance of the operation (ie: looking up a key and value pairs, insertion, etc) and if you (based on your analysis) know that keys may collide frequently in the hash table (which will cause bad performance for the hash table).
Several Structures exist here that you can use. Each has it's advantage and disadvantage.
A HashTable will have a great lookup time and insertion time, provided you have a table that minimizes collision. If not, then lookup/insertion can lead to a lot more time.
A Binary Search Tree has ln(n) insertion and lookup, provided that it's balanced. Sometimes the balancing can cause the insertion to take a bit longer then ln(n), depending on the BST you go with.
Can go with B+ tree, it guarantees lesser search complexity ( since you reach leaf nodes fast, height = log n to base k, k = degree of nodes). The databases have similar requirement and they use B+ trees to maintain and retrieve data.
I have some candidate aspects:
The hash function is important, the hashcode should be unique as far as possible.
The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.
My questions are:
Are there any more aspects worth attention?
How to design the hash_table to full fill these aspects?
Are there any resources I can refer to?
Many thanks!
After reading some material, update my questions. :)
In a book explaining the source code of SGI STL, I found some useful informations:
The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
Other data structures such as arrays with linear detection or square detection is not as good as linked list.
A good hash function can avoid clusters, and double hash can help to resolve clusters.
The question about multi_threads is still open. :D
There are two (slightly) orthogonal concern.
While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:
the hash function depends on the data to be stored
the backend depends on the requirements of the storage
For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).
For the back-end, there are various concerns, as you noted. Some remarks:
Are we talking average or worst case complexity ? Without perfect hashing, achieving O(1) is nigh-impossible as far as I know, though the worst case frequency and complexity can be considerably dampened.
Are we talking amortized complexity ? Amortized complexity in general offer better throughput at the cost of "spikes". Linear rehashing, at the cost of a slightly lower throughput, will give you a smoother curve.
With regard to multi-threading, note that the Read/Write pattern may impact the solution, considering extreme cases, 1 producer and 99 readers is very different from 99 producers and 1 reader. In general writes are harder to parallelize, because they may require modifying the structure. At worst, they might require serialization.
Garbage Collection is pretty trivial in the amortized case, with linear-rehashing it's a bit more complicated, but probably the least challenging portion.
You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.
References:
The article on Wikipedia exposes lots of various implementations, always good to peek at the variety
This GoogleTalk from Dr Cliff (Azul Systems) shows a hash table designed for heavily multi-threaded systems, in Java.
I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable
The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are:
To go from hash to index, use binary AND instead of modulo a prime. This is many times faster. Your table size must be a power of two.
For hash collisions don't use a linked list, store the values in the table to improve cache performance.
By using a state machine you can get a very fast multi-thread implementation. In his blog entry he lists the states in the state machine, but due to license problems he does not provide source code.
I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.
I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.
At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.