Cost of Synchronizing Open MP threads

Cost of Synchronizing Open MP threads - multithreading

Generally, how much is the overhead of an Open MP barrier in terms of clock cycles?
I mean the following:
Suppose all threads have already finished their work at hand at the same time. They all reach the start of the barrier at the same time.
How many extra clock cycles does it take to go pass the barrier?
Does synchronizing existing threads on Linux involve calls to the kernel of the OS?
Thanks.
Related:
How is thread synchronization implemented, at the assembly language level?
https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf

Related

Do any operating systems utilize user threads only?

We're reading a basic/simple guide to Operating Systems in my CS class. The text gives multiple examples of OSs that use 1:1 threading, and some that formerly did hybrid/ M:N. But there are no examples of user threads/N:1.
This isn't a homework question, I'm just genuinely curious if this is or was a thing. Have any OSs utilized exclusively user threads? Or is there any software or programming language that does? It seems like with the right scheduling it could be very fast? Thank you!
Spent forever on Google and can't find any explicit answer to this!

Do any operating systems utilize user threads only?
No (and not in the way you're expecting, but by definition). Whatever a program feels like doing in user-space is none of the operating system's business and can not be considered something the OS itself does.
Essentially there's 3 cases:
the OS is a single-tasking OS (and user-space programs use libraries or whatever to provide threading if/when they want it). E.g. MS-DOS.
the OS is a multi-tasking OS, where the OS only knows about processes (and user-space programs use libraries or whatever to provide threading if/when they want it). E.g. early Unix.
the OS/kernel provides threads (leading to 1:1 or M:N).
It seems like with the right scheduling it could be very fast?
User-space threading isn't "very fast", it's significantly worse for most things. The reasons are:
it can't work when there's multiple CPUs (so the nice 8-core CPU you're currently using becomes 87.5% wasted). You need a "M:N threading" at a minimum to avoid this performance disaster.
it breaks thread priorities badly - e.g. CPU/s wasting time doing unimportant work while important work isn't being done, because one process doesn't know anything about threads that belong to any other process (or their priorities). The scheduler must be aware of all threads to avoid this performance disaster (and if one process knows about all threads belonging to all other processes it becomes a security disaster).
almost all thread switches are caused by devices (threads having to wait for disk, network, keyboard, "wall clock time", ... causing scheduler to have to find some other thread to run; and things a thread was waiting for occurring causing the thread to be able to run again and possibly preempt less important work that was running at the time); and all devices involve the kernel (even for micro-kernels where kernel is needed to pass messages, etc); so almost all thread switches involve the kernel. By doing threading in user-space you just end up with kernel wasting time notifying user-space (so user-space can do some scheduling) instead of kernel doing the scheduling itself (without wasting time on notifications).
User-space threading is better for rare situations where kernel doesn't have to be involved anyway, which is limited to:
thread creation and termination; but only if memory (for thread state, thread stack, thread local storage) is pre-allocated and recycled, and only if "thread recycling" isn't done (e.g. pre-create kernel threads and put them back in a "free thread pool" instead of telling kernel to terminate and create them again later).
locking (e.g. mutexes) where all threads using the lock belong to the same process; where 1 kernel thread (and no need for locks) is still better than "multiple user-space threads (sharing 1 kernel thread) fighting for the same lock with extra pointless overhead".

Why spinlocks can become performance issue in multithreaded programs?

I know what spinlocks are and that they use busy waiting. But why can it become a performance problem in multithreaded programs on a multicore processor?
And what can be done about it?

Your first issue, is that in cases where a spinlock protected section becomes contented, is usually a situation where there are more threads ready for execution than you have cores available. That means each thread wasting time in a spinlock is potentially starving another thread which would had something proper to do.
Then there is the cost of the spinock itself. You are burning through your budget of memory transactions, and that budget is actually shared between processor cores. Effectively, this can result in slowing down the operations within the critical sections.
A good example for that would be the memory allocator in the Windows kernel, in versions between 1703 and 1803. On systems with more than 16 threads, once a 50% total CPU utilization as exceeded, a spinlock in that path went out of control and would start eating up 90% of the CPU time. Time spent inside the critical section increased over tenfold due to the competing threads burning the memory bandwidth.
The naive solution is to use nano-sleeps in between spin cycles in order to at least reduce the performance burnt on the locks themselves. But that's pretty bad as well, as the cores still remain blocked, not doing any real work.
Try and yield in the spin locks instead? Just turns even slower, and you end up with a minimum delay proportional to the scheduling rate of the operating system. At a rate of 1ms (Windows realtime mode, active when any process requests it), 5ms (Linux default 200Hz scheduler), 10ms (Windows default mode), that's a huge delay this is introducing into execution. And if you happen to hit the critical section again, it was wasteful as you now added the overhead for context switch without any gains.
Ultimately, use operating system primitives for critical sections. The common approach is to use atomic operations to probe if any contention has occurred, and when it has, only then to involve the operating system.
Either way, the operating system below has better means to resolve the contention, mostly in the form of wait lists. Meaning threads waiting on a semaphore only wait up exactly when they are allowed to resume, and are guaranteed to hold the corresponding lock. When leaving the contended region, the thread owning the lock checks via lightweight means if there had been any contention, and only if so notifies the OS to resume operation on the other threads.
Not that you should actually reinvent the wheel though...
In Windows, that's already how Slim Reader/Writer Locks are implemented.
If you use a plain std::mutex or alike, you will usually already end up with such mechanism under the hood.
"Old" literature (10-15 years) will still warn you not to use OS primitives for scheduling, but that's seriously outdated and does not reflect the improvements made on the OS side. What used to be 10ms+ delay for every context switch is essentially down to being barely measurable nowadays.

Benefits of user-level threads

I was looking at the differences between user-level threads and kernel-level threads, which I basically understood.
What's not clear to me is the point of implementing user-level threads at all.
If the kernel is unaware of the existence of multiple threads within a single process, then which benefits could I experience?
I have read a couple of articles that stated user-level implementation of threads is advisable only if such threads do not perform blocking operations (which would cause the entire process to block).
This being said, what's the difference between a sequential execution of all the threads and a "parallel" execution of them, considering they cannot take advantage of multiple processors and independent scheduling?
An answer to a previously asked question (similar to mine) was something like:
No modern operating system actually maps n user-level threads to 1
kernel-level thread.
But for some reason, many people on the Internet state that user-level threads can never take advantage of multiple processors.
Could you help me understand this, please?

I strongly recommend Modern Operating Systems 4th Edition by Andrew S. Tanenbaum (starring in shows such as the debate about Linux; also participating: Linus Torvalds). Costs a whole lot of bucks but it's definitely worth it if you really want to know stuff. For eager students and desperate enthusiasts it's great.
Your questions answered
[...] what's not clear to me is the point of implementing User-level threads
at all.
Read my post. It is comprehensive, I daresay.
If the kernel is unaware of the existence of multiple threads within a
single process, then which benefits could I experience?
Read the section "Disadvantages" below.
I have read a couple of articles that stated that user-level
implementation of threads is advisable only if such threads do not
perform blocking operations (which would cause the entire process to
block).
Read the subsection "No coordination with system calls" in "Disadvantages."
All citations are from the book I recommended in the top of this answer, Chapter 2.2.4, "Implementing Threads in User Space."
Advantages
Enables threads on systems without threads
The first advantage is that user-level threads are a way to work with threads on a system without threads.
The first, and most obvious, advantage is that
a user-level threads package can be implemented on an operating system that does not support threads. All operating systems used to
fall into this category, and even now some still do.
No kernel interaction required
A further benefit is the light overhead when switching threads, as opposed to switching to the kernel mode, doing stuff, switching back, etc. The lighter thread switching is described like this in the book:
When a thread does something that may cause it to become blocked
locally, for example, waiting for another thread in its process to
complete some work, it calls a run-time system procedure. This
procedure checks to see if the thread must be put into blocked state.
If, so it stores the thread’s registers (i.e., its own) [...] and
reloads the machine registers with the new thread’s saved values. As soon as the stack
pointer and program counter have been switched, the new thread comes
to life again automatically. If the machine happens to have an
instruction to store all the registers and another one to load them
all, the entire thread switch can be done in just a handful of in-
structions. Doing thread switching like this is at least an order of
magnitude—maybe more—faster than trapping to the kernel and is a
strong argument in favor of user-level threads packages.
This efficiency is also nice because it spares us from incredibly heavy context switches and all that stuff.
Individually adjusted scheduling algorithms
Also, hence there is no central scheduling algorithm, every process can have its own scheduling algorithm and is way more flexible in its variety of choices. In addition, the "private" scheduling algorithm is way more flexible concerning the information it gets from the threads. The number of information can be adjusted manually and per-process, so it's very finely-grained. This is because, again, there is no central scheduling algorithm needing to fit the needs of every process; it has to be very general and all and must deliver adequate performance in every case. User-level threads allow an extremely specialized scheduling algorithm.
This is only restricted by the disadvantage "No automatic switching to the scheduler."
They [user-level threads] allow each process to have its own
customized scheduling algorithm. For some applications, for example,
those with a garbage-collector thread, not having to worry about a
thread being stopped at an inconvenient moment is a plus. They also
scale better, since kernel threads invariably require some table space
and stack space in the kernel, which can be a problem if there are a
very large number of threads.
Disadvantages
No coordination with system calls
The user-level scheduling algorithm has no idea if some thread has called a blocking read system call. OTOH, a kernel-level scheduling algorithm would've known because it can be notified by the system call; both belong to the kernel code base.
Suppose that a thread reads from the keyboard before any keys have
been hit. Letting the thread actually make the system call is
unacceptable, since this will stop all the threads. One of the main
goals of having threads in the first place was to allow each one to
use blocking calls, but to prevent one blocked thread from affecting
the others. With blocking system calls, it is hard to see how this
goal can be achieved readily.
He goes on that system calls could be made non-blocking but that would be very inconvenient and compatibility to existing OSes would be drastically hurt.
Mr Tanenbaum also says that the library wrappers around the system calls (as found in glibc, for example) could be modified to predict when a system cal blocks using select but he utters that this is inelegant.
Building upon that, he says that threads do block often. Often blocking requires many system calls. And many system calls are bad. And without blocking, threads become less useful:
For applications that are essentially entirely CPU bound and rarely
block, what is the point of having threads at all? No one would
seriously propose computing the first n prime numbers or playing chess
using threads because there is nothing to be gained by doing it that
way.
Page faults block per-process if unaware of threads
The OS has no notion of threads. Therefore, if a page fault occurs, the whole process will be blocked, effectively blocking all user-level threads.
Somewhat analogous to the problem of blocking system calls is the
problem of page faults. [...] If the program calls or jumps to an
instruction that is not in memory, a page fault occurs and the
operating system will go and get the missing instruction (and its
neighbors) from disk. [...] The process is blocked while the necessary
instruction is being located and read in. If a thread causes a page
fault, the kernel, unaware of even the existence of threads, naturally
blocks the entire process until the disk I/O is complete, even though
other threads might be runnable.
I think this can be generalized to all interrupts.
No automatic switching to the scheduler
Since there is no per-process clock interrupt, a thread acquires the CPU forever unless some OS-dependent mechanism (such as a context switch) occurs or it voluntarily releases the CPU.
This prevents usual scheduling algorithms from working, including the Round-Robin algorithm.
[...] if a thread starts running, no other thread in that process
will ever run unless the first thread voluntarily gives up the CPU.
Within a single process, there are no clock interrupts, making it
impossible to schedule processes round-robin fashion (taking turns).
Unless a thread enters the run-time system of its own free will, the scheduler will never get a chance.
He says that a possible solution would be
[...] to have the run-time system request a clock signal (interrupt) once a
second to give it control, but this, too, is crude and messy to
program.
I would even go on further and say that such a "request" would require some system call to happen, whose drawback is already explained in "No coordination with system calls." If no system call then the program would need free access to the timer, which is a security hole and unacceptable in modern OSes.

What's not clear to me is the point of implementing user-level threads at all.
User-level threads largely came into the mainstream due to Ada and its requirement for threads (tasks in Ada terminology). At the time, there were few multiprocessor systems and most multiprocessors were of the master/slave variety. Kernel threads simply did not exist. User threads had to be created to implement languages like Ada.
If the kernel is unaware of the existence of multiple threads within a single process, then which benefits could I experience?
If you have kernel threads, threads multiple threads within a single process can run simultaneously. In user threads, the threads always execute interleaved.
Using threads can simplify some types of programming.
I have read a couple of articles that stated user-level implementation of threads is advisable only if such threads do not perform blocking operations (which would cause the entire process to block).
That is true on Unix and maybe not all unix implementations. User threads on many operating systems function perfectly fine with blocking I/O.
This being said, what's the difference between a sequential execution of all the threads and a "parallel" execution of them, considering they cannot take advantage of multiple processors and independent scheduling?
In user threads. there is never parallel execution. In kernel threads, the can be parallel execution IF there are multiple processors. On a single processor system, there is not much advantage to using kernel threads over single threads (contra: note the blocking I/O issue on Unix and user threads).
But for some reason, many people on the Internet state that user-level threads can never take advantage of multiple processors.
In user threads, the process manages its own "threads" by interleaving execution within itself. The process can only have a thread run in the processor that the process is running in.
If the operating system provides system services to schedule code to run on a different processor, user threads could run on multiple processors.
I conclude by saying that for practicable purposes there are no advantages to user threads over kernel threads. There are those that will assert that there are performance advantages, but for there to be such an advantage it would be system dependent.

How do user level threads (ULTs) and kernel level threads (KLTs) differ with regards to concurrent execution?

Here's what I understand; please correct/add to it:
In pure ULTs, the multithreaded process itself does the thread scheduling. So, the kernel essentially does not notice the difference and considers it a single-thread process. If one thread makes a blocking system call, the entire process is blocked. Even on a multicore processor, only one thread of the process would running at a time, unless the process is blocked. I'm not sure how ULTs are much help though.
In pure KLTs, even if a thread is blocked, the kernel schedules another (ready) thread of the same process. (In case of pure KLTs, I'm assuming the kernel creates all the threads of the process.)
Also, using a combination of ULTs and KLTs, how are ULTs mapped into KLTs?

Your analysis is correct. The OS kernel has no knowledge of user-level threads. From its perspective, a process is an opaque black box that occasionally makes system calls. Consequently, if that program has 100,000 user-level threads but only one kernel thread, then the process can only one run user-level thread at a time because there is only one kernel-level thread associated with it. On the other hand, if a process has multiple kernel-level threads, then it can execute multiple commands in parallel if there is a multicore machine.
A common compromise between these is to have a program request some fixed number of kernel-level threads, then have its own thread scheduler divvy up the user-level threads onto these kernel-level threads as appropriate. That way, multiple ULTs can execute in parallel, and the program can have fine-grained control over how threads execute.
As for how this mapping works - there are a bunch of different schemes. You could imagine that the user program uses any one of multiple different scheduling systems. In fact, if you do this substitution:
Kernel thread <---> Processor core
User thread <---> Kernel thread
Then any scheme the OS could use to map kernel threads onto cores could also be used to map user-level threads onto kernel-level threads.
Hope this helps!

Before anything else, templatetypedef's answer is beautiful; I simply wanted to extend his response a little.
There is one area which I felt the need for expanding a little: combinations of ULT's and KLT's. To understand the importance (what Wikipedia labels hybrid threading), consider the following examples:
Consider a multi-threaded program (multiple KLT's) where there are more KLT's than available logical cores. In order to efficiently use every core, as you mentioned, you want the scheduler to switch out KLT's that are blocking with ones that in a ready state and not blocking. This ensures the core is reducing its amount of idle time. Unfortunately, switching KLT's is expensive for the scheduler and it consumes a relatively large amount of CPU time.
This is one area where hybrid threading can be helpful. Consider a multi-threaded program with multiple KLT's and ULT's. Just as templatetypedef noted, only one ULT can be running at one time for each KLT. If a ULT is blocking, we still want to switch it out for one which is not blocking. Fortunately, ULT's are much more lightweight than KLT's, in the sense that there less resources assigned to a ULT and they require no interaction with the kernel scheduler. Essentially, it is almost always quicker to switch out ULT's than it is to switch out KLT's. As a result, we are able to significantly reduce a cores idle time relative to the first example.
Now, of course, all of this depends on the threading library being used for implementing ULT's. There are two ways (which I can come up with) for "mapping" ULT's to KLT's.
A collection of ULT's for all KLT's
This situation is ideal on a shared memory system. There is essentially a "pool" of ULT's to which each KLT has access. Ideally, the threading library scheduler would assign ULT's to each KLT upon request as opposed to the KLT's accessing the pool individually. The later could cause race conditions or deadlocks if not implemented with locks or something similar.
A collection of ULT's for each KLT (Qthreads)
This situation is ideal on a distributed memory system. Each KLT would have a collection of ULT's to run. The draw back is that the user (or the threading library) would have to divide the ULT's between the KLT's. This could result in load imbalance since it is not guaranteed that all ULT's will have the same amount of work to complete and complete roughly the same amount of time. The solution to this is allowing for ULT migration; that is, migrating ULT's between KLT's.

Can thread creation within OS internals run concurrently?

Suppose we have a dual-core machine with a mainstream, modern OS capable to utilize both the cores.
If I have two threads, P1 and Q1 within the same process, and they happen to commence creating child threads, say, P2 and Q2, at approximately the same machine cycle, will OS perform the thread creation concurrently?
I heard thread creation is expensive, so the question came forth...
Thanks in advance.

Any reasonably well designed OS can have multiple processors executing kernel code at the same time. Therefore some of the tasks involved in a thread creation can be happening concurrently. But there will be some necessary serialization to manipulate some shared data structures (e.g. allocating memory, inserting a newly created threat structure into a global list). The processors could contend for the same lock thereby reducing concurrency.
Systems/applications which make new threads so often that the overhead of thread creation actually matters are probably designed wrong (doing too little useful work in a thread relative to the startup time, and not taking advantage of the obvious optimization of reusing short-lived threads from a pool).

It will be sorta-concurrently. There are aspects of thread-creation that cannot proceed in parallel - it would be unfortunate if the kernel memory-manager allocated both threads the same stack!
Thread creation is sufficiently expensive that it's worth while avoiding doing it at all during an app. run, hence the popularity of thread pools. Long-running tasks that block can be threaded off and left for the life of the app - often this means that explicit thread termination, (awkward at best, almost impossible at worst, from user code), is not necessary.
I think developers continually start and stop threads because they like to think of them as 'functions', where you 'pass parameters' in at the start and 'return' results when the thread ends. Ths is not the best way of conceptualizing threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string