When designing a hash_table, how many aspects should be paid attention to? - multithreading

I have some candidate aspects:
The hash function is important, the hashcode should be unique as far as possible.
The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.
My questions are:
Are there any more aspects worth attention?
How to design the hash_table to full fill these aspects?
Are there any resources I can refer to?
Many thanks!
After reading some material, update my questions. :)
In a book explaining the source code of SGI STL, I found some useful informations:
The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
Other data structures such as arrays with linear detection or square detection is not as good as linked list.
A good hash function can avoid clusters, and double hash can help to resolve clusters.
The question about multi_threads is still open. :D

There are two (slightly) orthogonal concern.
While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:
the hash function depends on the data to be stored
the backend depends on the requirements of the storage
For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).
For the back-end, there are various concerns, as you noted. Some remarks:
Are we talking average or worst case complexity ? Without perfect hashing, achieving O(1) is nigh-impossible as far as I know, though the worst case frequency and complexity can be considerably dampened.
Are we talking amortized complexity ? Amortized complexity in general offer better throughput at the cost of "spikes". Linear rehashing, at the cost of a slightly lower throughput, will give you a smoother curve.
With regard to multi-threading, note that the Read/Write pattern may impact the solution, considering extreme cases, 1 producer and 99 readers is very different from 99 producers and 1 reader. In general writes are harder to parallelize, because they may require modifying the structure. At worst, they might require serialization.
Garbage Collection is pretty trivial in the amortized case, with linear-rehashing it's a bit more complicated, but probably the least challenging portion.
You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.
References:
The article on Wikipedia exposes lots of various implementations, always good to peek at the variety
This GoogleTalk from Dr Cliff (Azul Systems) shows a hash table designed for heavily multi-threaded systems, in Java.

I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable
The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are:
To go from hash to index, use binary AND instead of modulo a prime. This is many times faster. Your table size must be a power of two.
For hash collisions don't use a linked list, store the values in the table to improve cache performance.
By using a state machine you can get a very fast multi-thread implementation. In his blog entry he lists the states in the state machine, but due to license problems he does not provide source code.

Related

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.
If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

How to score all user-product combinations in Spark MatrixFactorizationModel?

Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?
Via the current API, once could pass a cartesian product of user-product to the predict function, but it seems to me that this will do a lot of extra processing.
Would accessing the private userFeatures, productFeatures be the correct approach, and if so, is there a good way to take advantage of other aspects of the framework to distribute this computation in an efficient way? Specifically, is there an easy way to do better than multiplying all pairs of userFeature, productFeature "by hand"?
Spark 1.1 has a recommendProducts method that can be mapped to each user ID. This is better than nothing but not really optimized for recommending to all users.
I would double-check that you really mean to make recommendations for everyone; at scale, this is inherently a big slow operation. Consider predicting for users that have been recently active only.
Otherwise, yes your best bet is to create your own method. The cartesian join of the feature RDDs is probably too slow as it's shuffling so many copies of the feature vectors. Choose the larger of the user / product feature set, and map that. In each worker, hold the other product / user feature set in memory in each worker. If this isn't feasible you can make this more complex and map several times against subsets of the smaller RDD in memory.
As of Spark 2.2, recommendProductsForUsers(num) would be the method.
Recommends the top "num" number of products for all users. The number of recommendations returned per user may be less than "num".
https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html

How do AV engines search files for known signatures so efficiently?

Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.
UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based
detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used
Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.
Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.
The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.
I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.
For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.

Performance of Long IDs

I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:
"000ab56cb24aef9b817ac98d55695c6a"
Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).
For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?
The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.
It also affects storage, as longer keys take more space, space takes time.
However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.
However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).
So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.
Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.
If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.
From the CouchDB: The definitive guide book:
I need to draw a picture of this at
some point, but the reason is if you
think of the idealized btree, when you
use UUID’s you might be hitting any
number of root nodes in that tree, so
with the append only nature you have
to write each of those nodes and
everything above it in the tree. but
if you use monotonically increasing
id’s then you’re invalidating the same
path down the right hand side of the
tree thus minimizing the number of
nodes that need to be rewritten. would
be just the same for monotonically
decreasing as well. and it should
technically work if you’re updates can
be guaranteed to hit one or two nodes
in the inside of the tree, though
that’s much harder to prove.
So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.

Resources