Data Structures useful in multi-threaded programming [closed] - multithreading

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Besides the well known data Structure which can be used in Multi-Threaded programs such as Concurrent Stack, Concurrent Queue, Concurrent List, Concurrent Hashing.
Are there any other lesser know but useful Data Structures which can be used in parallel/Multi-Threaded programming.
Even if they are some different versions of above listed data structures with some optimization, then kindly do share.
Please do include some references.
Edit: Will keep listing what I find
1) ConcurrentCuckooHashing (Optimized version of ConcurrentHashing)
2) ConcurrentSkipList

I will try to answer with examples from JDK, if you do not mind:
Lists:
CopyOnWriteArrayList is a list that achieves thread-safe usage by recreating backing array each time the list is modified;
Lists returned by Collections.synchronizedList() are thread-safe as they include exclusive locking for most operations (iteration over is an exception);
ArrayBlockingQueue. Queue that has a fixed size and blocks when there's nothing to pull out or no space to push in;
ConcurrentLinkedQueue is a lock-free queue based on Michael-Scott algorithm;
Concurrent stack, based on Treiber algorithm. Surprisingly, I didn't find that in JDK;
Sets:
Sets, returned by factory Collections.newSetFromMap() with a backing ConcurrentHashMap. With these sets you can be sure that its iterators is not prone to ConcurrentModificationException, and they use a striping technique for locking it - locking all set is not neccesary to perform some operations. For example, when you want to add element, only the part of the set determined by element hashCode() will be locked;
ConcurrentSkipListSet. The thread-safe set based on a Skip List data structure;
Sets, returned by Collections.synchronizedSet(). All points written about similar lists are applicable here.
Maps:
ConcurrentHashMap which I already mentioned and explained. Striping is based on item keys;
ConcurrentSkipListMap. Thread-safe map, based on skip list;
Maps, returned by Collections.synchronizedMap(). All points written about similar lists and maps are applicable here.
These were more or less standard data structures intended for multithreaded usage, which should be enough for most practical tasks. I also found some links you may find useful:
Wait-free red-black tree;
A huge article about concurrent data structures in general;
Concurrent structures and synchronization primitives used in .NET;
Articles about transactional memory - not really a data structure, but since your request serves academic purposes, worth reading;
One more article about transactional memory - Easy to read, but it is in Russian language. If you can read it, definetely worth reading;

Related

rdb vs key-value store for django functionality [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
When would one choose a key-value data store over a relational DB? What considerations go into deciding one or the other? When is mix of both the best route? Please provide examples if you can.
Key-value, heirarchical, map-reduce, or graph database systems are much closer to implementation strategies, they are heavily tied to the physical representation. The primary reason to choose one of these is if there is a compelling performance argument and it fits your data processing strategy very closely. Beware, ad-hoc queries are usually not practical for these systems, and you're better off deciding on your queries ahead of time.
Relational database systems try to separate the logical, business-oriented model from the underlying physical representation and processing strategies. This separation is imperfect, but still quite good. Relational systems are great for handling facts and extracting reliable information from collections of facts. Relational systems are also great at ad-hoc queries, which the other systems are notoriously bad at. That's a great fit in the business world and many other places. That's why relational systems are so prevalent.
If it's a business application, a relational system is almost always the answer. For other systems, it's probably the answer. If you have more of a data processing problem, like some pipeline of things that need to happen and you have massive amounts of data, and you know all of your queries up front, another system may be right for you.
If your data is simply a list of things and you can derive a unique identifier for each item, then a KVS is a good match. They are close implementations of the simple data structures we learned in freshman computer science and do not allow for complex relationships.
A simple test: can you represent your data and all of its relationships as a linked list or hash table? If yes, a KVS may work. If no, you need an RDB.
You still need to find a KVS that will work in your environment. Support for KVSes, even the major ones, is nowhere near what it is for, say, PostgreSQL and MySQL/MariaDB.
IMO, Key value pair (e.g. NoSQL databases) works best when the underlying data is unstructured, unpredictable, or changing often. If you don't have structured data, a relational database is going to be more trouble than its worth because you will need to make lots of schema changes and/or jump through hoops to conform your data to the structure.
KVP / JSON / NoSql is great because changes to the data structure do not require completely refactoring the data model. Adding a field to your data object is simply a matter of adding it to the data. The other side of the coin is there are fewer constraints and validation checks in a KVP / Nosql database than a relational database so your data might get messy.
There are performance and space saving benefits for relational data models. Normalized relational data can make understanding and validating the data easier because there are table key relationships and constraints to help you out.
One of the worst patterns i've seen is trying to have it both ways. Trying to put a key-value pair into a relational database is often a recipe for disaster. I would recommend using the technology that suits your data foremost.
If you want O(1) lookups of values based on keys, then you want a KV store. Meaning, if you have data of the form k1={foo}, k2={bar}, etc, even when the values are larger/ nested structures, and want fast lookups, you want a KV store.
Even with proper indexing, you cannot achieve O(1) lookups in a relational DB for arbitrary keys. Sometimes this is referred to as "random lookups".
Alliteratively stated, if you only ever query by one column, a "primary key" if you will, to retrieve the rest of the data, then using that column as a keyspace and the rest of the data as a value in a KV store is the most efficient way to do lookups.
In contrast, if you often query the data by any of several columns, aka you support a richer query API for the data, then you may want a relational database.
A traditional relational database has problems scaling beyond a point. Where that point is depends a bit on what you are trying to do.
All (most?) of the suppliers of cloud computing are providing key-value data stores.
However, if you have a reasonably sized application with a complicated data structure, then the support that you get from using a relational database can reduce your development costs.
In my experience, if you're even asking the question whether to use traditional vs esoteric practices, then go traditional. While esoteric practices are sexy, challenging, and fun, 99.999% of applications call for a traditional approach.
With regards to relational vs KV, the question you should be asking is:
Why would I not want to use a relational model for this scenario: ...
Since you have not described the scenario, it's impossible for anyone to tell you why you shouldn't use it. The "catch all" reason for KV is scalability, which isn't a problem now. Do you know the rules of optimization?
Don't do it.
(for experts only) Don't do it now.
KV is a highly optimized solution to scalability that will most likely be completely unecessary for your application.

How will one implement like and dislike counts, for a post, in Couchdb/Couchbase [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How do one implement live like and dislike [or say views count] in couchdb/couchbase in the most efficient way.
Yeah one can use reduce to calculate count each time and on front end only use increment and decrement to one API call to get results.
But for every post there will be say millions of views, like and dislikes.
If we will have millions of such post [in a social networking site], the index will be simply too big.
In terms of Cloudant, the described use case requires a bit of care:
Fast writes
Ever-growing data set
Potentially global queries with aggregations
The key here is to use an immutable data model--don't update any existing documents, only create new ones. This means that you won't have to suffer update conflicts as the load increases.
So a post is its own document in one database, and the likes stored separately. For likes, you have a few options. The classic CouchDB solution would be to have a separate database with "likes" documents containing the post id of the post they refer to, with a view emitting the post id, aggregated by the built-in _count. This would be a pretty efficient solution in this case, but yes, indexes do occupy space on Couch-like databases (just like as with any other database).
Second option would be to exploit the _id field, as this is an index you get for free. If you prefix the like-documents' ids with the liked post's id, you can do an _all_docs query with a start and end key to get all the likes for that post.
Third - recent CouchDBs and Cloudant has the concept of partitioned databases, which very loosely speaking can be viewed as a formalised version of option two above, where you nominate a partition key which is used to ensure a degree of storage locality behind the scenes -- all documents within the same partition are stored in the same shard. This means that it's faster to retrieve -- and on Cloudant, also cheaper. In your case you'd create a partitioned "likes" database with the partition key being the post-id. Glynn Bird wrote up a great intro to partitioned DBs here.
Your remaining issue is that of ever-growth. At Cloudant, we'd expect to get to know you well once your data volume goes beyond single digit TBs. If you'd expect to reach this kind of volume, it's worth tackling that up-front. Any of the likes schemes above could most likely be time-boxed and aggregated once a quarter/month/week or whatever suits your model.

What is the most efficient implementation of arrays with functional updates?

I need an array-like data structure with the fastest possible functional update. I've seen a few different implementation of flexible arrays that provide me with this property (Braun, Random Access Lists) but I'm wondering if there is an implementation that is specifically optimized for the case when we are not interested in append or prepend - just updates.
Jean-Cristophe FilliĆ¢tre has a very nice implementation of persistent arrays, that is described in the paper linked at the same page (which is about persistent union-find, of which persistent arrays are a core component). The code is directly available there.
The idea is that "the last version" of the array is represented as an usual array, with O(1) access and update operations, and previous versions are represented as this last version, plus a list of the differences. If you try to access a previous version of the structure, the array is "rerooted" to apply the list of differences and present you again the efficient representation.
This will of course not be O(1) under all workflows (if you constantly access and modify unrelated versions of the structure, you will pay rerooting costs frequently), but for the common workflow of mainly working with one version, and occasionally backtracking to an older version that becomes the "last version" again and gets the updates, this is very efficient. A very nice use of mutability hidden under a observationally pure interface.
I have a very good experience with repa (nice demo). Very good performance, automatic parallelism, multidimensional, polymorphic. Recommended to try.
Which language are you using? In Haskell you can use mutable arrays with the state monad, and in Mercury you can use mutable arrays by threading the IO state. Ocaml also has an array module, which unfortunately does not maintain referential transparency, if that is what you are after.
I also needed functional arrays and felt on this SO question some days ago. I was not satisfied with the solution proposed by Gasche as creating a new array is a costly operation and I need to access older versions of array quite frequently (I plan to use this for an AI alpha/beta implementation playing on an array).
(When I say costly, I guess it is O(n*h) where h is the history size because in the worst case only one cell was updated repeatedly and it is needed to go through the whole update list for each cell. I also expect most of the cells not being updated when I need to reroute the array).
This is why I propose another approach, maybe I can get some feedback here. My idea is to store the array like in a B-Tree, except that as it is not mutable, I can access and update any value by index quite easily.
I wrote a small introduction on the project's repository : https://github.com/shepard8/ocaml-ptarray. The order is chosen so as to even depth and order of the tree, so I can get nice complexities depending only on the order for get/set operations, namely O(k^2).
With k = 10 I can store up to 10^10 values. Actually my arrays should not contain more than 200 values but this is to show how robust my solution is intended to be.
Any advice welcome!

When designing a hash_table, how many aspects should be paid attention to?

I have some candidate aspects:
The hash function is important, the hashcode should be unique as far as possible.
The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.
My questions are:
Are there any more aspects worth attention?
How to design the hash_table to full fill these aspects?
Are there any resources I can refer to?
Many thanks!
After reading some material, update my questions. :)
In a book explaining the source code of SGI STL, I found some useful informations:
The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
Other data structures such as arrays with linear detection or square detection is not as good as linked list.
A good hash function can avoid clusters, and double hash can help to resolve clusters.
The question about multi_threads is still open. :D
There are two (slightly) orthogonal concern.
While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:
the hash function depends on the data to be stored
the backend depends on the requirements of the storage
For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).
For the back-end, there are various concerns, as you noted. Some remarks:
Are we talking average or worst case complexity ? Without perfect hashing, achieving O(1) is nigh-impossible as far as I know, though the worst case frequency and complexity can be considerably dampened.
Are we talking amortized complexity ? Amortized complexity in general offer better throughput at the cost of "spikes". Linear rehashing, at the cost of a slightly lower throughput, will give you a smoother curve.
With regard to multi-threading, note that the Read/Write pattern may impact the solution, considering extreme cases, 1 producer and 99 readers is very different from 99 producers and 1 reader. In general writes are harder to parallelize, because they may require modifying the structure. At worst, they might require serialization.
Garbage Collection is pretty trivial in the amortized case, with linear-rehashing it's a bit more complicated, but probably the least challenging portion.
You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.
References:
The article on Wikipedia exposes lots of various implementations, always good to peek at the variety
This GoogleTalk from Dr Cliff (Azul Systems) shows a hash table designed for heavily multi-threaded systems, in Java.
I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable
The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are:
To go from hash to index, use binary AND instead of modulo a prime. This is many times faster. Your table size must be a power of two.
For hash collisions don't use a linked list, store the values in the table to improve cache performance.
By using a state machine you can get a very fast multi-thread implementation. In his blog entry he lists the states in the state machine, but due to license problems he does not provide source code.

How to avoid mutable state (when multithreading)

Multithreading is hard. The only this you can do is program very carefully and follow good advice. One great advice I got from the answers on this forum is to avoid mutable state. I understand this is even enforced in the Erlang language. However, I fail to see how this can be done without a severe performance hit and huge amounts of caches.
For example. You have a big list of objects, each containing quite a lot of properties; in other words: a large datastructure. Suppose you have got a bunch of threads and they all need to access and modify the list. How can this be done without shared memory without having to cache the whole datastructure in each of the threads?
Update: After reading the reactions so far, I would like to put some more emphasis on performance. Don't you think that copying the same data around will make the program slower than with shared memory?
Not each algorithm can be parallelized in a successful manner.
If your program doesn't exhibit any "parallel structure", then you're pretty doomed to use locking and shared, mutable structures.
If your algorithm exhibit structure, then you can express your computation in terms of some patterns or formalism (for ex., a macro dataflow graph) that makes the choice of an immutable datastruct trivial.
So: think in term of the structure of the algorithm and just not in term of the properties of the datastructure to use.
You can get a great start in thinking about immutable collections, where they are applicable, how they can actually work without requiring lots of copying, etc. by looking through Eric Lippert's articles tagged with immutability:
http://blogs.msdn.com/ericlippert/archive/tags/Immutability/default.aspx
I guess the first question is: why do they need to modify the list? Would it be possible for them to return their changes as a list of modifications rather than actually modifying the shared list? Could they work with a list which looks like it's a mutable version of the original list, but is actually only locally mutable? Are you changing which elements are in the list, or just the properties of those elements?
These are just questions rather than answers, but I'm trying to encourage you to think about your problem in a different way. Look at the bigger picture as the task you want to achieve instead of thinking about the way you'd tackle it in a normal imperative, mutable fashion. Changing the way you think about problems is very difficult, but you may find you get some great "aha!" moments :)
There are many pitfalls when working with multiple threads and large sets of data. The advice to avoid mutable state is meant to try to make life easier for you if you can manage to follow the guideline (i.e. if you have no mutable state then multi-threading will be much easier).
If you have a large amount of data that does need to be modified then you perhaps cannot avoid mutable state. An alternative though would be to partition the data into blocks, each of which is passed to a thread for manipulation. The block can be processed and then passed back, and the controller can then perform the updates where necessary. In this scenario you have removed the mutable state from out of the the thread.
If this cannot be done and each thread needs update access to the full list (i.e. it could update any item on the list at any time) then you are going to have a lot of fun trying to make sure you have got your locking strategies and concurrency issues sorted. I'm sure there are scenarios where this is required, and the design pattern of avoiding mutable state may not apply.
Just using immutable data-objects is a big help.
Modifying lists sounds like a constructed argument, but consider granular methods that are unaware of lists.
If you really need to update the structure one way to do this is have a single worker thread which picks up update requests from a fixed area prtected by a mutex.
If you are clever you can update the structure in place without affecting any "reading"
threads (e.g. If you are adding to the end of an array you do all the work to add the new structure but only as the very last instruction do you increment the NoOfMembers count -- the reading threads should not see the new entry until you do this - or - arrange your data as an array of references to structures -- when you want to update a structure you copy the current member, update it, then as the last operation replace the reference in the array)
The other threads then only need to check a single simple "update in progess" mutex only when they activly want to update.

Resources