(Terminology) "Contended" vs "Contented" Locks - multithreading

What is the difference, if any, when talking about "contented" locks and "contended" locks.
I recently heard the word "contented" used for the first time in a discussion about locking and evidently the two terms are used with almost equal frequency:
contented 367,000 results
contended 353,000 results
"Contention" and "contended" make sense to me, as they are words meaning in conflict, but "contented" means satisfied/at peace, so color me confused.

"Contended" describes a lock that different threads are trying to acquire at the same time, "Heavily-contended" if numerous threads are all trying to acquire the same lock, "uncontended" describing cases where a thread doesn't have any competition to acquire a lock.
"Contented" might be a typo, an erroneous autocorrect, or maybe an eggcorn).
Here's an example from the Oracle website, on the weblog of David Dice, a senior research scientist in Oracle who specializes in concurrency applications. If "contented" had a meaning specific to locks or multithreading I expect he would know about it. The contented typo appeared in his weblog (it's been corrected in the article text but it remains in the article url), someone commented on seeing "contented". David Dice replied:
Thanks for catching the embarrassing typo, which I just fixed! Like you, I wonder exactly what the semantics of "#contented" would mean (:>). Regards, -Dave
For some of these results Google seems to be anticipating our inability to spell. Google returns this link on the first page of matches for "contented site:oracle.com", even though the word "contented" does not appear in it.

Locks are either contended or uncontended. A lock is considered contended
if a thread blocks while trying to acquire the lock. If a lock is available when
a thread attempts to acquire it, the lock is considered uncontended. Con-
tended locks can experience either high contention (a relatively large number
of threads attempting to acquire the lock) or low contention (a relatively
small number of threads attempting to acquire the lock.) Unsurprisingly,
highly contended locks tend to decrease overall performance of concurrent
applications.

Related

What are the advantages of lock-free programming over spin lock?

I am wondering which are the advantages of lock-free programming over spin lock? I think that when we do lock free programming using CAS mechanism in a thread(called A), if other thread change the value in CAS, A thread still need to loop again. And I think it just like we use spin lock!
I am so confused about this. Although I know that CAS and spin-lock are suitable to use when the lock contention is not fierce, can someone explain in which scenarios lock free should be used and spin lock should be used?
Lock-freedom provides what is called progress guarantee. You are right that in your example thread A has do perform a retry (i.e., loop again), but only if some other thread changed the value, which implies that that thread was able to make progress.
In contrast, a thread (let's call it X) that holds a spin-lock prevents all other threads from making progress until the lock is released. So if thread X is preempted, execution of all threads waiting for the lock is effectively stalled until X can resume execution and finally release the lock. If X were to be stalled indefinitely, then all other threads would also be blocked indefinitely.
Such a situation is not possible with lock-free algorithms, since it is guaranteed that at any time at least one thread can make progress.
Which should be used depends on the situation. Lock-free algorithms are inherently difficult to design, especially for more complex data structures like trees. And even if you have a lock-free algorithm, it is almost always slower than a serial one, so a serial version protected by a lock might perform better. Then again, if the data structure is heavily contended, a lock-free version will scale better than one protected by a lock. However, if your workload is mostly read-only, a read-write-lock will also provide good scalability. Unfortunately, there is no general rule here...
If you want to learn more about lock-freedom (and more) I recommend the book The Art of Multiprocessor Programming.
If you prefer free alternatives I recommend Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul McKenney or Practicallock-freedom by Keir Fraser.

Many threads and critical region

If I have many threads running at the same time how can I allow only 1 thread to enter the critical region? Also what will happen if I had more than 1 thread in the critical region?
There are several kinds of bugs that can happen as a result of improperly-protected critical sections, but the most common is called a race condition. This occurs when the correctness of your program's behavior or output is dependent on events occurring in a particular sequence, but it's possible that the events will occur in a different sequence. This will tend to cause the program to behave in an unexpected or unpredictable manner. (Sorry to be a little vague on that last point but by its very nature it's often difficult to predict in advance what the exact result will be other than to say "it probably won't be what you wanted").
Generally, to fix this, you'd use some kind of lock to ensure that only one thread can access the critical section at once. The most common mechanism for this is a mutex lock, which is used for the "simple" case - you have some kind of a shared resource and only one thread can access it at one time.
There are some other mechanisms available too for more complicated cases, such as:
Reader-Writer Locks - either one person can write to the resource or an unlimited number of people can be reading from it.
Counting semaphore - some specified number of threads can access a particular thread at one time. As an analogy, think of a parking lot that only has, for example, 100 spaces - once 100 cars are parked there, they can't accept any more (or, at least, until one of them leaves).
The .NET Framework provides a ManualResetEvent - basically, the threads in question have to wait until an event happens.
This isn't a lock per se, but it's becoming increasingly common to use immutable data structures to obviate the need for locking in the first place. The idea here is that no thread can modify another thread's data, they're always working against either a local version or the unmodified "central" version.

Usefulness of Lockfree programming on Userlevel Threads

AFAIK, as opposed to time-slice based scheduler preemption, pure User-level threads(ULTs) have the property of yielding the processor to the other threads. However, from my surfing on internet I see that we have several preemptive User Thread mechanisms now.
Keeping this in mind, wanted to start a discussion on benefits of Lock-free programming on User level threads. My understanding is that irrespective of presence of preemptive scheduler performance of Lock free programming should surpass that of mutex/semaphore based programs.
However, I am still confused; since acquire operation on mutex also takes fast-path in the absence of contention, performance gain need not be attractive enough to migrate to Lock-free approach.
In case of semaphores, there is a invocation to system call leading to context switch and hence lock-free approaches can be seen as much better option.
Please suggest for both the situations - ULT equipped with preemptive mechanism and the one without it.
This is not an easy question to answer, as it is very general, it will boil down to what your requirements are.
I have recently been working with systems where the use of lock free structures was considered, but when we sat down and wrote out our requirements, we realized that they are in fact not what we want. Our system didn't really require them, and in fact locking helps us because we typically have a producer/consumer architecture where if there is nothing being produced (i.e. nothing being added to a queue) then the consumer should be idle (i.e. blocked).
I recently wrote about this in more detail:
http://blog.chrisd.info/a-simple-thread-safe-queue-for-use-in-multi-threaded-c-applications/

Why lock may become a bottleneck of multithreaded program?

Why lock may become a bottleneck of multithreaded program?
If I want my queue frequently pop() and push() by multithread,
which lock should I use?
The lock you use depends on your platform but will generally be some flavour of mutex. On windows, you would use a critical section and in .NET, you'd use a monitor. I'm not very familiar with locking mechanisms on other platforms. I'd stay away from lock free approaches. They are very difficult to program correctly and the performance gains are often not as great as you would expect.
Locks become a bottleneck in your program when they are under heavy contention. That is, a very large number of threads all try to acquire the lock at the same time. This wastes a lot of CPU cycles as threads become blocked and the OS spends a greater and greater portion of its time switching between threads. This sort of problem most frequently manifests itself in the server world. For desktop applications, it's rare that locks will cause a performance issue.
"Why lock may become a bottleneck of multithreaded program?" - think of a turnstile (also called a baffle gate), which only lets one person through at a time, with a crowd of people waiting to go through it.
For a queue, use the simplest lock your environment has to offer.
For a queue, it is easy to write a lock-free implementation (google away)
Locks are bottlenecks because they force all other threads which encounter them to stop doing what they're doing and wait for the lock to open, thus wasting time. One of the ideas behind multithreading is to use as many processors as possible at any given time. By forcing threads to wait on the locks the application essentially gives up processing power which it might have used.
"Why lock may become a bottleneck of multithreaded program?"
Because waiting threads remain blocked until shared memory is unlocked.
Suggest you read this article on "Concurrency: What Every Dev Must Know About Multithreaded Apps" http://msdn.microsoft.com/en-au/magazine/cc163744.aspx
Locks are expensive both because they require operating system calls in the middle of your algorithm and because they are hard to do properly when creating the CPU.
As a programmer, it is best to leave the locks in the middle of your data structures to the experts and instead use a good multithreaded library such as Intel's TBB
For Queues, you would want to use Atomic instructions (hard) or a spinlock (easier) if possible because they are cheap compared to a mutex. Use a mutex if you are doing a lot of work that needs to be locked, i.e modify a complex tree structure
In the threading packages that I'm familiar with, your options for mutexes are recursive and non-recursive. You should opt for non-recursive -- all of your accesses will be lock(); queue_op(); unlock(), so there's no need to be able to acquire the lock twice.

How can I write a lock free structure?

In my multithreaded application and I see heavy lock contention in it, preventing good scalability across multiple cores. I have decided to use lock free programming to solve this.
How can I write a lock free structure?
Short answer is:
You cannot.
Long answer is:
If you are asking this question, you do not probably know enough to be able to create a lock free structure. Creating lock free structures is extremely hard, and only experts in this field can do it. Instead of writing your own, search for an existing implementation. When you find it, check how widely it is used, how well is it documented, if it is well proven, what are the limitations - even some lock free structure other people published are broken.
If you do not find a lock free structure corresponding to the structure you are currently using, rather adapt the algorithm so that you can use some existing one.
If you still insist on creating your own lock free structure, be sure to:
start with something very simple
understand memory model of your target platform (including read/write reordering constraints, what operations are atomic)
study a lot about problems other people encountered when implementing lock free structures
do not just guess if it will work, prove it
heavily test the result
More reading:
Lock free and wait free algorithms at Wikipedia
Herb Sutter: Lock-Free Code: A False Sense of Security
Use a library such as Intel's Threading Building Blocks, it contains quite a few lock -free structures and algorithms. I really wouldn't recommend attempting to write lock-free code yourself, it's extremely error prone and hard to get right.
Writing thread-safe lock free code is hard; but this article from Herb Sutter will get you started.
As sblundy pointed out, if all objects are immutable, read-only, you don't need to worry about locking, however, this means you may have to copy objects a lot. Copying usually involves malloc and malloc uses locking to synchronize memory allocations across threads, so immutable objects may buy you less than you think (malloc itself scales rather badly and malloc is slow; if you do a lot of malloc in a performance critical section, don't expect good performance).
When you only need to update simple variables (e.g. 32 or 64 bit int or pointers), perform simply addition or subtraction operations on them or just swap the values of two variables, most platforms offer "atomic operations" for that (further GCC offers these as well). Atomic is not the same as thread-safe. However, atomic makes sure, that if one thread writes a 64 bit value to a memory location for example and another thread reads from it, the reading one either gets the value before the write operation or after the write operation, but never a broken value in-between the write operation (e.g. one where the first 32 bit are already the new, the last 32 bit are still the old value! This can happen if you don't use atomic access on such a variable).
However, if you have a C struct with 3 values, that want to update, even if you update all three with atomic operations, these are three independent operations, thus a reader might see the struct with one value already being update and two not being updated. Here you will need a lock if you must assure, the reader either sees all values in the struct being either the old or the new values.
One way to make locks scale a lot better is using R/W locks. In many cases, updates to data are rather infrequent (write operations), but accessing the data is very frequent (reading the data), think of collections (hashtables, trees). In that case R/W locks will buy you a huge performance gain, as many threads can hold a read-lock at the same time (they won't block each other) and only if one thread wants a write lock, all other threads are blocked for the time the update is performed.
The best way to avoid thread-issues is to not share any data across threads. If every thread deals most of the time with data no other thread has access to, you won't need locking for that data at all (also no atomic operations). So try to share as little data as possible between threads. Then you only need a fast way to move data between threads if you really have to (ITC, Inter Thread Communication). Depending on your operating system, platform and programming language (unfortunately you told us neither of these), various powerful methods for ITC might exist.
And finally, another trick to work with shared data but without any locking is to make sure threads don't access the same parts of the shared data. E.g. if two threads share an array, but one will only ever access even, the other one only odd indexes, you need no locking. Or if both share the same memory block and one only uses the upper half of it, the other one only the lower one, you need no locking. Though it's not said, that this will lead to good performance; especially not on multi-core CPUs. Write operations of one thread to this shared data (running one core) might force the cache to be flushed for another thread (running on another core) and these cache flushes are often the bottle neck for multithread applications running on modern multi-core CPUs.
As my professor (Nir Shavit from "The Art of Multiprocessor Programming") told the class: Please don't. The main reason is testability - you can't test synchronization code. You can run simulations, you can even stress test. But it's rough approximation at best. What you really need is mathematical correctness proof. And very few capable understanding them, let alone writing them.
So, as others had said: use existing libraries. Joe Duffy's blog surveys some techniques (section 28). The first one you should try is tree-splitting - break to smaller tasks and combine.
Immutability is one approach to avoid locking. See Eric Lippert's discussion and implementation of things like immutable stacks and queues.
in re. Suma's answer, Maurice Herlithy shows in The Art of Multiprocessor Programming that actually anything can be written without locks (see chapter 6). iirc, This essentially involves splitting tasks into processing node elements (like a function closure), and enqueuing each one. Threads will calculate the state by following all nodes from the latest cached one. Obviously this could, in worst case, result in sequential performance, but it does have important lockless properties, preventing scenarios where threads could get scheduled out for long peroids of time when they are holding locks. Herlithy also achieves theoretical wait-free performance, meaning that one thread will not end up waiting forever to win the atomic enqueue (this is a lot of complicated code).
A multi-threaded queue / stack is surprisingly hard (check the ABA problem). Other things may be very simple. Become accustomed to while(true) { atomicCAS until I swapped it } blocks; they are incredibly powerful. An intuition for what's correct with CAS can help development, though you should use good testing and maybe more powerful tools (maybe SKETCH, upcoming MIT Kendo, or spin?) to check correctness if you can reduce it to a simple structure.
Please post more about your problem. It's difficult to give a good answer without details.
edit immutibility is nice but it's applicability is limited, if I'm understanding it right. It doesn't really overcome write-after-read hazards; consider two threads executing "mem = NewNode(mem)"; they could both read mem, then both write it; not the correct for a classic increment function. Also, it's probably slow due to heap allocation (which has to be synchronized across threads).
Inmutability would have this effect. Changes to the object result in a new object. Lisp works this way under the covers.
Item 13 of Effective Java explains this technique.
Cliff Click has dome some major research on lock free data structures by utilizing finite state machines and also posted a lot of implementations for Java. You can find his papers, slides and implementations at his blog: http://blogs.azulsystems.com/cliff/
Use an existing implementation, as this area of work is the realm of domain experts and PhDs (if you want it done right!)
For example there is a library of code here:
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/
Most lock-free algorithms or structures start with some atomic operation, i.e. a change to some memory location that once begun by a thread will be completed before any other thread can perform that same operation. Do you have such an operation in your environment?
See here for the canonical paper on this subject.
Also try this wikipedia article article for further ideas and links.
The basic principle for lock-free synchronisation is this:
whenever you are reading the structure, you follow the read with a test to see if the structure was mutated since you started the read, and retry until you succeed in reading without something else coming along and mutating while you are doing so;
whenever you are mutating the structure, you arrange your algorithm and data so that there is a single atomic step which, if taken, causes the entire change to become visible to the other threads, and arrange things so that none of the change is visible unless that step is taken. You use whatever lockfree atomic mechanism exists on your platform for that step (e.g. compare-and-set, load-linked+store-conditional, etc.). In that step you must then check to see if any other thread has mutated the object since the mutation operation began, commit if it has not and start over if it has.
There are plenty of examples of lock-free structures on the web; without knowing more about what you are implementing and on what platform it is hard to be more specific.
If you are writing your own lock-free data structures for a multi-core cpu, do not forget about memory barriers! Also, consider looking into Software Transaction Memory techniques.
Well, it depends on the kind of structure, but you have to make the structure so that it carefully and silently detects and handles possible conflicts.
I doubt you can make one that is 100% lock-free, but again, it depends on what kind of structure you need to build.
You might also need to shard the structure so that multiple threads work on individual items, and then later on synchronize/recombine.
As mentioned, it really depends on what type of structure you're talking about. For instance, you can write a limited lock-free queue, but not one that allows random access.
Reduce or eliminate shared mutable state.
In Java, utilize the java.util.concurrent packages in JDK 5+ instead of writing your own. As was mentioned above, this is really a field for experts, and unless you have a spare year or two, rolling your own isn't an option.
Can you clarify what you mean by structure?
Right now, I am assuming you mean the overall architecture. You can accomplish it by not sharing memory between processes, and by using an actor model for your processes.
Take a look at my link ConcurrentLinkedHashMap for an example of how to write a lock-free data structure. It is not based on any academic papers and doesn't require years of research as others imply. It simply takes careful engineering.
My implementation does use a ConcurrentHashMap, which is a lock-per-bucket algorithm, but it does not rely on that implementation detail. It could easily be replaced with Cliff Click's lock-free implementation. I borrowed an idea from Cliff, but used much more explicitly, is to model all CAS operations with a state machine. This greatly simplifies the model, as you'll see that I have psuedo locks via the 'ing states. Another trick is to allow laziness and resolve as needed. You'll see this often with backtracking or letting other threads "help" to cleanup. In my case, I decided to allow dead nodes on the list be evicted when they reach the head, rather than deal with the complexity of removing them from the middle of the list. I may change that, but I didn't entirely trust my backtracking algorithm and wanted to put off a major change like adopting a 3-node locking approach.
The book "The Art of Multiprocessor Programming" is a great primer. Overall, though, I'd recommend avoiding lock-free designs in the application code. Often times it is simply overkill where other, less error prone, techniques are more suitable.
If you see lock contention, I would first try to use more granular locks on your data structures rather than completely lock-free algorithms.
For example, I currently work on multithreaded application, that has a custom messaging system (list of queues for each threads, the queue contains messages for thread to process) to pass information between threads. There is a global lock on this structure. In my case, I don't need speed so much, so it doesn't really matter. But if this lock would become a problem, it could be replaced by individual locks at each queue, for example. Then adding/removing element to/from the specific queue would didn't affect other queues. There still would be a global lock for adding new queue and such, but it wouldn't be so much contended.
Even a single multi-produces/consumer queue can be written with granular locking on each element, instead of having a global lock. This may also eliminate contention.
If you read several implementations and papers regarding the subject, you'll notice there is the following common theme:
1) Shared state objects are lisp/clojure style inmutable: that is, all write operations are implemented copying the existing state in a new object, make modifications to the new object and then try to update the shared state (obtained from a aligned pointer that can be updated with the CAS primitive). In other words, you NEVER EVER modify an existing object that might be read by more than the current thread. Inmutability can be optimized using Copy-on-Write semantics for big, complex objects, but thats another tree of nuts
2) you clearly specify what allowed transitions between current and next state are valid: Then validating that the algorithm is valid become orders of magnitude easier
3) Handle discarded references in hazard pointer lists per thread. After the reference objects are safe, reuse if possible
See another related post of mine where some code implemented with semaphores and mutexes is (partially) reimplemented in a lock-free style:
Mutual exclusion and semaphores

Resources