Why is the throughput of the MCS lock poor when the number of threads is greater than the number of logical cpus - multithreading

Why is the throughput of the mcs lock poor when the number of threads is greater than the number of logical cpus.
Could it be because of increased contention for a places on cpu which leads to a lot of threads being pre-empted?

I am not 100% on this, but the Microsoft library gives this definition of the Sleep() function:
After the sleep interval has passed, the thread is ready to run. If you specify 0 >milliseconds, the thread will relinquish the remainder of its time slice but remain >ready. Note that a ready thread is not guaranteed to run immediately. Consequently, the >thread may not run until some time after the sleep interval elapses.
In my experience, if I use an MCS lock to, lets say, update a data structure and the number of threads I run it on is 16 the drop off (excluding the massive drop off from 1 - 2 threads) from 8 to 16 threads (assuming you are just doubling the number of threads) is quite large. Throughput drops to about a third after one thread and then slowly decreased to as the number of threads being used approaches the number of CPUs. Obviously if you are using a lock the more threads that are trying to acquire the lock you will have more cache cache coherency work for the CPU's to do.
If you use any atomic instructions (again assuming you are) the more threads you add the slower this will become.
"I don't think the problem is that atomic operations will take longer themselves; the real problem might be that an atomic operation might block bus operations on other processors (even if they perform non-atomic operations)."
This was taken from another member of stackoverflow about similar issue. Couple that with the fact that a thread may or may not sleep, even with the use of Sleep(), and may or may not wake immediately this could cause a serious loss in throughput. You also have the increased bus traffic to deal with...

Related

Does having more threads lead to smaller time slices?

I'm trying to understand why having too many threads can reduce CPU usage due to the increased overhead of context switching. An explanation that sounded plausible to me is that increasing # of threads also increases the frequency of context switches, meaning we end up spending more time context switching and less time doing useful work. Is this correct? Do individual time slices get compressed (with more context switches in between) as we have more threads to schedule?
Generally no. The primary mechanism for lower overhead is that if the scheduler picks the same thread to run on a core for two timeslices in a row, there is no context-switch overhead of stale caches and an FP save/restore.
A "tickless" kernel might set a timer farther in the future if there aren't any other tasks to schedule, instead of the traditional design of having a timer interrupt every 1 or 10 milliseconds where it always calls a scheduler function. (And if there aren't any waiting tasks, it can trivially decide to keep running this one.)

Improvement on execution time from an aplication done with multi-threading is limited by the number of physical cores?

I was doing some testing with multi-threading on a linux virtual machine, and I implemented a benchmark with 10 threads (in this application each instruction would be executed 10x times more than in the single-thread scenario) and i was tweaking with the number of "physical cores" from the VM settings and with the single thread case I obtain 3s on average independently of the number of physical cores, If the number of cores is set to 1, and I run the multi-thread version, the execution time will be 30s. If I run it with 2 cores I obtain 15s and with 8 cores (the maximum number I can set) I obtain 6s, I obtain this dependancy due to the fact that I´m executing 10x times each instruction or is always like this?
If you have N threads running on N cores, and if they are all doing pure computation (i.e., not waiting for any I/O devices), and if they are all completely independent of each other, then they should be able to do N times as much work in a given amount of time as a single thread can do in the same amount of time.
But, that's if they are completely independent. That's a hard thing to achieve. For example, if the threads can't each do all of their work in their own, independent cache (e.g., in L1 cache,) then they will compete with each other for access to the main memory. They will sometimes have to wait for one another, because only one core can access main memory at any given moment. So, if the threads need to use memory, then the speedup will be somewhat less than N times.
If the threads need to share data in main memory, then it gets worse because then they will need to use mutual exclusion locks. One thread may keep a lock locked while it executes dozens of instructions, and any other thread that wants the same lock will have to wait until it is finished.
If the threads need to synchronize with each other/communicate with each other, then it gets worse still because unless their work loads are carefully balanced, a thread with less work to do may spend long periods of time awaiting signals from threads that have more work to do.
It's not unusual for a novice programmer to invent a multi-threaded version of some single-threaded algorithm, and find out that the multi-threaded version actually is slower than the single-threaded version.
There are some algorithms, for which even an expert programmer can't get much speed up by throwing more threads at it.

Context switch: what happens in a worst case scenario?

I want to understand how a certain worst case scenario of context switch happens. Say I have 10 CPU cores running a single process. Everything is CPU intensive, no thread is sleeping (waiting for I/O).
(I am mainly concerned with mainstream modern personal computer architectures and systems, typically x64 with Windows, Linux...)
Correct me if I'm wrong: running 10 CPU/RAM intensive independent threads is most often a near optimal situation. The amount of time spent in context switch is rather negligible. While the system may sometimes decide to re-attribute threads to different cores in a round-robin fashion causing a reset of RAM caches, it has a minor effect and works almost as if each thread was running on a single fixed core.
Only the main RAM bus may be a limitation since all threads share it, but it's not the point I'm interested in here. Reducing the number of threads will not increase the throughput anyway.
Now assume you still have 10 cores but run 1000 threads. The scheduler could theoretically decide to switch rarely (say every second) running 10 threads for a second, then 10 others... and the whole thing would still be close to optimal performance (throughput).
But it does not seem to be the case and it looks like threads are switched intensively causing a strongly suboptimal performance (throughput). Am I right about it? What is the main cause for this suboptimal performance? A few numbers would be nice if you have any idea of orders of magnitude of (for example): switches per second, performance loss caused by switching...
I'm going to answer my own question (after some search).
On windows, the number of context switches can be measured with performance counters: https://technet.microsoft.com/en-us/library/cc938606.aspx
I measured it on my machine (core i7/Windows 10) and the order of magnitude is around 1000/s by core when the number of running threads is more than the number of cores (and these threads are full CPU).
The time needed for a context switch varies quite a bit depending on:
what registers need to be saved
if FPU registers need to be saved
the processor model (of course)
You can read: https://www.quora.com/How-long-does-a-context-switch-take or http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
A slightly pessimistic avg. order of magnitude seems to be 1000 ns. Thus the total time for all context switches on each core is 1ms per second, that is 0.1%.
This does not depend on the number of threads: if you run 100 or 1000 threads, the number of switches does not change. As a conclusion the time spent in context switching is somehow negligible.
This reasoning is correct as long as the threads are pure CPU with only small memory read/write like a few local variables. I ran a test with full CPU threads and the difference between a few and 1000 threads is not noticeable.
But the situation changes when RAM is involved and switches makes CPU (memory) cache less efficient. A worse case is when:
computation can be split into 1000 independent "data" parts
each part of the data fits just into the memory cache (say L1 or L2) of a core
each part needs to be read many times
In this situation, running 10 threads to completion, then ten others... would take full advantage of the cache, while running 1000 threads at a time would causes the cache to be useful only during 1ms.
But if the data of several threads could fit into the cache, or if the threads read common data to some degree, or if each thread reads the data just once, then it is possible that running 1000 threads vs. running 10 threads a hundred times will have similar throughput.
It is more a matter a adapting parallelism to memory access. And it depends very much on the way memory needs to be accessed.
The time spend in context switching is negligible, the time lost because of wrong usage of caches may sometimes be problem, sometimes not, depending on how the memory is accessed and shared.

How efficient is locking and unlocked mutex? What is the cost of a mutex?

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.
I have the choice in between either having a bunch of mutexes or a single one for an object.
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
Less locks means less problems debugging multi-threading problems.
More locks means less contentions and higher parallelism
More locks means more chances of running into undebugable deadlocks.
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.
I wanted to know the same thing, so I measured it.
On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz),
locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
Due to synchronization between two cores (I used CPU #0 and #1),
I could only call a lock/unlock pair once every 102 ns on two threads,
so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
The program that I used to investigate this can be found here:
https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
The graph it produces in that state is:
This shows the result of benchmark runs on the following code:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).
This depends on what you actually call "mutex", OS mode and etc.
At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.
For example on Intel Core Duo processor, Windows XP.
Interlocked operation: takes about 40 CPU cycles.
Kernel mode call (i.e. system call) - about 2000 CPU cycles.
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.
I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.
The cost will vary depending on the implementation but you should keep in mind two things:
the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.
In both those cases, the instructions are relatively efficient.
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)
I just measured it on my Windows 10 system.
This is testing Single Threaded code with no contention at all.
Compiler: Visual Studio 2019, x64 release, with loop overhead subtracted from measurements.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles.
So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.
How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.
If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.
You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Resources