Why lock may become a bottleneck of multithreaded program? - multithreading

Why lock may become a bottleneck of multithreaded program?
If I want my queue frequently pop() and push() by multithread,
which lock should I use?

The lock you use depends on your platform but will generally be some flavour of mutex. On windows, you would use a critical section and in .NET, you'd use a monitor. I'm not very familiar with locking mechanisms on other platforms. I'd stay away from lock free approaches. They are very difficult to program correctly and the performance gains are often not as great as you would expect.
Locks become a bottleneck in your program when they are under heavy contention. That is, a very large number of threads all try to acquire the lock at the same time. This wastes a lot of CPU cycles as threads become blocked and the OS spends a greater and greater portion of its time switching between threads. This sort of problem most frequently manifests itself in the server world. For desktop applications, it's rare that locks will cause a performance issue.

"Why lock may become a bottleneck of multithreaded program?" - think of a turnstile (also called a baffle gate), which only lets one person through at a time, with a crowd of people waiting to go through it.
For a queue, use the simplest lock your environment has to offer.

For a queue, it is easy to write a lock-free implementation (google away)
Locks are bottlenecks because they force all other threads which encounter them to stop doing what they're doing and wait for the lock to open, thus wasting time. One of the ideas behind multithreading is to use as many processors as possible at any given time. By forcing threads to wait on the locks the application essentially gives up processing power which it might have used.

"Why lock may become a bottleneck of multithreaded program?"
Because waiting threads remain blocked until shared memory is unlocked.
Suggest you read this article on "Concurrency: What Every Dev Must Know About Multithreaded Apps" http://msdn.microsoft.com/en-au/magazine/cc163744.aspx

Locks are expensive both because they require operating system calls in the middle of your algorithm and because they are hard to do properly when creating the CPU.
As a programmer, it is best to leave the locks in the middle of your data structures to the experts and instead use a good multithreaded library such as Intel's TBB
For Queues, you would want to use Atomic instructions (hard) or a spinlock (easier) if possible because they are cheap compared to a mutex. Use a mutex if you are doing a lot of work that needs to be locked, i.e modify a complex tree structure

In the threading packages that I'm familiar with, your options for mutexes are recursive and non-recursive. You should opt for non-recursive -- all of your accesses will be lock(); queue_op(); unlock(), so there's no need to be able to acquire the lock twice.

Related

What happens when multiple threads try to access a critical section exactly at the same time?

I've being trying to find an answer for that, and all I could find it that once a thread reaches a critical section it locks it in front of other threads (or some other lock mechanism is being used to lock the critical section).
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Can I simply assume the the program will malfunction?
Note: I am referencing to a multicore CPUs.
Thanks.
I think you are missing the point of the fundamental locking primitives like Semaphores. If correct primitive is used, and used correctly, then the timing of the threads do not matter. They may well be simultaneous. The Operating System guarantees that no two thread will enter the critical section. Even on multicore machines, this bit is specially implemented (with lots of trickery even) to get that assurance.
To address your concerns specifically:
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
No. The other threads could have reached in the same microsecond, BUT if the locking mechanism is correct, then only one the competing threads will "enter" the critical section and others will wait.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.
Can I simply assume the the program will malfunction?
Ideally the program should not malfunction. But any code will have bugs - so does your code and the Operating System code for the Semaphores. So it is safe to assume that in some edge cases the program will indeed malfunction. But this assumption is true for any code in general.
Locking and Critical Sections are rather tricky to correctly implement. So for non academic purposes we should always use the system provided locking primitives. All Operating Systems expose stuff like Semaphores which most programming languages have ways to use. Some programming languages have their own lightweight implementations which provide somewhat softer guarantees but at a higher performance. As I said, while doing Critical Sections, it is critical to choose the correct thing and also to implement it correctly.
...But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Short answer; Memory system hardware makes it impossible for two different processors to access the same memory location at the same time. I'm not a computer architect, so I can't explain how it works, but the memory system serializes all of the accesses to the shared, main memory by the various CPUs in a multi-CPU system.
"Entering a critical section" means locking a mutex, and a mutex basically is just a flag in shared memory that is accesses by a specific protocol.
It is the task of the cache coherence protocol to make sure there are no 2 writes on the same chunk of memory (cache line) at the same time. With MESI there can be multiple readers of the same cacheline, but only 1 writer.
So if 2 threads at the same time want to write to the same cacheline, their requests will be serialized by cache coherence protocol.
Most CPU architecture support atomic operations like CAS. On the X86 this can be done using a lock prefix. The CPU will lock the cacheline when it starts with the CAS instruction and will not respond to cache coherence requests from other cores till it is finished with the atomic operation.
So if you would have 2 CPUs that both want to do a CAS, these operations are serialized by the underlying hardware.

In concurrent programming is it possible that, by using locks, a program might sometimes use more processors than are necessary?

This is an exam question (practice exam, not the real one). It's about concurrent programming using a multi-core processor and the problems with using locks.
"In concurrent programming is it possible that, by using locks, a program might sometimes use more processors than are necessary?"
In other words, is this ever possible? It's a true/false question. I can't find an answer anywhere and I'm revising for my exams.
The concurrent program with N threads of execution using locks at any point in time can have M=0 .. N-1 threads waiting for locks; thus this program can only be utilizing N-M processors since waiting for a lock does not require a processor.
Thus, no, using locks does not increase the number of processors required by a concurrent program.
With an efficient implementation of multi-threading and locks, if a thread blocks waiting for a lock for any significant time, the scheduler / lock implementation will reassign the core to do something else.
But since the exam question is asking if it is ever possible to use more processors than are strictly necessary, the answer is that it depends on the implementation of threads / locks / scheduling. For instance, there is a kind of lock called a spinlock where the lock implementation does NOT surrender control of the processor while waiting to acquire a lock. Instead, it polls the lock in a tight loop trying to acquire it.
Why would you do that? Well, if the lock is likely to become available in a short enough period of time, then the CPU wasted "spinning" on the lock is less than would be spent on performing a full context switch.
So I don't think your exam question has a simple yes / no answer.

What are the advantages of lock-free programming over spin lock?

I am wondering which are the advantages of lock-free programming over spin lock? I think that when we do lock free programming using CAS mechanism in a thread(called A), if other thread change the value in CAS, A thread still need to loop again. And I think it just like we use spin lock!
I am so confused about this. Although I know that CAS and spin-lock are suitable to use when the lock contention is not fierce, can someone explain in which scenarios lock free should be used and spin lock should be used?
Lock-freedom provides what is called progress guarantee. You are right that in your example thread A has do perform a retry (i.e., loop again), but only if some other thread changed the value, which implies that that thread was able to make progress.
In contrast, a thread (let's call it X) that holds a spin-lock prevents all other threads from making progress until the lock is released. So if thread X is preempted, execution of all threads waiting for the lock is effectively stalled until X can resume execution and finally release the lock. If X were to be stalled indefinitely, then all other threads would also be blocked indefinitely.
Such a situation is not possible with lock-free algorithms, since it is guaranteed that at any time at least one thread can make progress.
Which should be used depends on the situation. Lock-free algorithms are inherently difficult to design, especially for more complex data structures like trees. And even if you have a lock-free algorithm, it is almost always slower than a serial one, so a serial version protected by a lock might perform better. Then again, if the data structure is heavily contended, a lock-free version will scale better than one protected by a lock. However, if your workload is mostly read-only, a read-write-lock will also provide good scalability. Unfortunately, there is no general rule here...
If you want to learn more about lock-freedom (and more) I recommend the book The Art of Multiprocessor Programming.
If you prefer free alternatives I recommend Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul McKenney or Practicallock-freedom by Keir Fraser.

Lightweight Threads in Operating Systems

It is said that one of the main benefits of Node (and presumable twisted et al) over more conventional threaded servers, is the very high concurrency enabled by the event loop model. The biggest reason for this is that each thread has a high memory footprint and swapping contexts is comparatively expensive. When you have thousands of threads the server spends most of its time swapping from thread to thread.
My question is, why don't operating systems or the underlying hardware support much more lightweight threads? If they did, could you solve the 10k problem with plain threads? If they can't, why is that?
Modern operating systems can support the execution of a very large number of threads.
More generally, hardware keeps getting faster (and recently, it has been getting faster in a way that is much friendlier to multithreading and multiprocessing than to single-threaded event loops - ie, increased number of cores, rather than increased processing throughput capabilities in a single core). If you can't afford the overhead of a thread today, you can probably afford it tomorrow.
What the cooperative multitasking systems of Twisted (and presumably Node.js et al) offers over pre-emptive multithreading (at least in the form of pthreads) is ease of programming.
Correctly using multithreading involves being much more careful than correctly using a single thread. An event loop is just the means of getting multiple things done without going beyond your single thread.
Considering the proliferation of parallel hardware, it would be ideal for multithreading or multiprocessing to get easier to do (and easier to do correctly). Actors, message passing, maybe even petri nets are some of the solutions people have attempted to solve this problem. They are still very marginal compared to the mainstream multithreading approach (pthreads). Another approach is SEDA, which uses multiple threads to run multiple event loops. This also hasn't caught on.
So, the people using event loops have probably decided that programmer time is worth more than CPU time, and the people using pthreads have probably decided the opposite, and the people exploring actors and such would like to value both kinds of time more highly (clearly insane, which is probably why no one listens to them).
The issue isn't really how heavyweight the threads are but the fact that to write correct multithreaded code you need locks on shared items and that prevents it from scaling with the number of threads because threads end up waiting for each other to gain locks and you rapidly reach the point where adding additional threads has no effect or even slows the system down as you get more lock contention.
In many cases you can avoid locking, but it's very difficult to get right, and sometimes you simply need a lock.
So if you are limited to a small number of threads, you might well find that removing the overhead of having to lock resources at all, or even think about it, makes a single threaded program faster than a multithreaded program no matter how many threads you add.
Basically locks can (depending on your program) be really expensive and can stop your program scaling beyond a few threads. And you almost always need to lock something.
It's not the overhead of a thread that's the problem, it's the synchronization between the threads. Even if you could switch between threads instantly, and had infinite memory none of that helps if each thread just ends up waiting in a queue for it's turn at some shared resource.

linux thread synchronization

I am new to linux and linux threads. I have spent some time googling to try to understand the differences between all the functions available for thread synchronization. I still have some questions.
I have found all of these different types of synchronizations, each with a number of functions for locking, unlocking, testing the lock, etc.
gcc atomic operations
futexes
mutexes
spinlocks
seqlocks
rculocks
conditions
semaphores
My current (but probably flawed) understanding is this:
semaphores are process wide, involve the filesystem (virtually I assume), and are probably the slowest.
Futexes might be the base locking mechanism used by mutexes, spinlocks, seqlocks, and rculocks. Futexes might be faster than the locking mechanisms that are based on them.
Spinlocks dont block and thus avoid context swtiches. However they avoid the context switch at the expense of consuming all the cycles on a CPU until the lock is released (spinning). They should only should be used on multi processor systems for obvious reasons. Never sleep in a spinlock.
The seq lock just tells you when you finished your work if a writer changed the data the work was based on. You have to go back and repeat the work in this case.
Atomic operations are the fastest synch call, and probably are used in all the above locking mechanisms. You do not want to use atomic operations on all the fields in your shared data. You want to use a lock (mutex, futex, spin, seq, rcu) or a single atomic opertation on a lock flag when you are accessing multiple data fields.
My questions go like this:
Am I right so far with my assumptions?
Does anyone know the cpu cycle cost of the various options? I am adding parallelism to the app so we can get better wall time response at the expense of running fewer app instances per box. Performances is the utmost consideration. I don't want to consume cpu with context switching, spinning, or lots of extra cpu cycles to read and write shared memory. I am absolutely concerned with number of cpu cycles consumed.
Which (if any) of the locks prevent interruption of a thread by the scheduler or interrupt...or am I just an idiot and all synchonization mechanisms do this. What kinds of interruption are prevented? Can I block all threads or threads just on the locking thread's CPU? This question stems from my fear of interrupting a thread holding a lock for a very commonly used function. I expect that the scheduler might schedule any number of other workers who will likely run into this function and then block because it was locked. A lot of context switching would be wasted until the thread with the lock gets rescheduled and finishes. I can re-write this function to minimize lock time, but still it is so commonly called I would like to use a lock that prevents interruption...across all processors.
I am writing user code...so I get software interrupts, not hardware ones...right? I should stay away from any functions (spin/seq locks) that have the word "irq" in them.
Which locks are for writing kernel or driver code and which are meant for user mode?
Does anyone think using an atomic operation to have multiple threads move through a linked list is nuts? I am thinking to atomicly change the current item pointer to the next item in the list. If the attempt works, then the thread can safely use the data the current item pointed to before it was moved. Other threads would now be moved along the list.
futexes? Any reason to use them instead of mutexes?
Is there a better way than using a condition to sleep a thread when there is no work?
When using gcc atomic ops, specifically the test_and_set, can I get a performance increase by doing a non atomic test first and then using test_and_set to confirm? I know this will be case specific, so here is the case. There is a large collection of work items, say thousands. Each work item has a flag that is initialized to 0. When a thread has exclusive access to the work item, the flag will be one. There will be lots of worker threads. Any time a thread is looking for work, they can non atomicly test for 1. If they read a 1, we know for certain that the work is unavailable. If they read a zero, they need to perform the atomic test_and_set to confirm. So if the atomic test_and_set is 500 cpu cycles because it is disabling pipelining, causes cpu's to communicate and L2 caches to flush/fill .... and a simple test is 1 cycle .... then as long as I had a better ratio of 500 to 1 when it came to stumbling upon already completed work items....this would be a win.
I hope to use mutexes or spinlocks to sparilngly protect sections of code that I want only one thread on the SYSTEM (not jsut the CPU) to access at a time. I hope to sparingly use gcc atomic ops to select work and minimize use of mutexes and spinlocks. For instance: a flag in a work item can be checked to see if a thread has worked it (0=no, 1=yes or in progress). A simple test_and_set tells the thread if it has work or needs to move on. I hope to use conditions to wake up threads when there is work.
Thanks!
Application code should probably use posix thread functions. I assume you have man pages so type
man pthread_mutex_init
man pthread_rwlock_init
man pthread_spin_init
Read up on them and the functions that operate on them to figure out what you need.
If you're doing kernel mode programming then it's a different story. You'll need to have a feel for what you are doing, how long it takes, and what context it gets called in to have any idea what you need to use.
Thanks to all who answered. We resorted to using gcc atomic operations to synchronize all of our threads. The atomic ops were about 2x slower than setting a value without synchronization, but magnitudes faster than locking a mutex, changeing the value, and then unlocking the mutex (this becomes super slow when you start having threads bang into the locks...) We only use pthread_create, attr, cancel, and kill. We use pthread_kill to signal threads to wake up that we put to sleep. This method is 40x faster than cond_wait. So basicly....use pthreads_mutexes if you have time to waste.
in addtion you should check the nexts books
Pthreads Programming: A POSIX
Standard for Better Multiprocessing
and
Programming with POSIX(R) Threads
regarding question # 8
Is there a better way than using a condition to sleep a thread when there is no work?
yes i think that the best aproach instead of using sleep
is using function like sem_post() and sem_wait of "semaphore.h"
regards
A note on futexes - they are more descriptively called fast userspace mutexes. With a futex, the kernel is involved only when arbitration is required, which is what provides the speed up and savings.
Implementing a futex can be extremely tricky (PDF), debugging them can lead to madness. Unless you really, really, really need the speed, its usually best to use the pthread mutex implementation.
Synchronization is never exactly easy, but trying to implement your own in userspace makes it inordinately difficult.

Resources