mlock() and Threading

mlock() and Threading - linux

Can mlock() be called safely from independently executing OpenMP or Posix threads, given that each thread is operating on a different region of virtual memory? Does it create a systemwide synchronization barrier or force all threads to stall in some way?
I apologize if this is a duplicate; I was surprised when google searches for "mlock openmp"/"mlock thread safety" did not turn up the answer immediately. Closest I could find was the second answer of Non-blocking mlock(), which seems to indicate that mlock() CAN be called from separate threads and does not enforce or require any synchronization barriers.

mlock() is safe to call from multiple threads at once.
As to whether it synchronises against other calls to mlock(), it's a quality-of-implementation issue - in principle any system call could synchronise against any other, there's no text in POSIX that disallows it. In practice you will often find that system calls that work on the process's memory map tend to contend with each other (so mlock() might not just contend with other mlock() calls but also mmap()). You will need to test to see if contention is a actually a problem in your use case.

Related

Why threads implemented in kernel space are slow?

When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
Source: Modern Operating Systems (Andrew S. Tanenbaum | Herbert Bos)
The above argument is made in favor of user-level threads. The user-level thread implementation is depicted as kernel managing all the processes, where individual processes can have their own run-time (made available by a library package) that manages all the threads in that process.
Of course, merely calling a function in the run-time than trapping to kernel might have a few less instructions to execute but why the difference is so huge?
For example, if threads are implemented in kernel space, every time a thread has to be created the program is required to make a system call. Yes. But the call only involves adding an entry to the thread table with certain attributes (which is also the case in user space threads). When a thread switch has to happen, kernel can simply do what the run-time (at user-space) would do. The only real difference I can see here is that the kernel is being involved in all this. How can the performance difference be so significant?

Threads implemented as a library package in user space perform significantly better. Why?
They're not.
The fact is that most task switches are caused by threads blocking (having to wait for IO from disk or network, or from user, or for time to pass, or for some kind of semaphore/mutex shared with a different process, or some kind of pipe/message/packet from a different process) or caused by threads unblocking (because whatever they were waiting for happened); and most reasons to block and unblock involve the kernel in some way (e.g. device drivers, networking stack, ...); so doing task switches in kernel when you're already in the kernel is faster (because it avoids the overhead of switching to user-space and back for no sane reason).
Where user-space task switching "works" is when kernel isn't involved at all. This mostly only happens when someone failed to do threads properly (e.g. they've got thousands of threads and coarse-grained locking and are constantly switching between threads due to lock contention, instead of something sensible like a "worker thread pool"). It also only works when all threads are the same priority - you don't want a situation where very important threads belonging to one process don't get CPU time because very unimportant threads belonging to a different process are hogging the CPU (but that's exactly what happens with user-space threading because one process has no idea about threads belonging to a different process).
Mostly; user-space threading is a silly broken mess. It's not faster or "significantly better"; it's worse.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
This is talking about a situation where the CPU itself does the actual task switch (and either the kernel or a user-space library tells the CPU when to do a task switch to what). This has some relatively interesting history behind it...
In the 1980s Intel designed a CPU ("iAPX" - see https://en.wikipedia.org/wiki/Intel_iAPX_432 ) for "secure object oriented programming"; where each object has its own isolated memory segments and its own privilege level, and can transfer control directly to other objects. The general idea being that you'd have a single-tasking system consisting of global objects using cooperating flow control. This failed for multiple reasons, partly because all the protection checks ruined performance, and partly because the majority of software at the time was designed for "multi-process preemptive time sharing, with procedural programming".
When Intel designed protected mode (80286, 80386) they still had hopes for "single-tasking system consisting of global objects using cooperating flow control". They included hardware task/object switching, local descriptor table (so each task/object can have its own isolated segments), call gates (so tasks/objects can transfer control to each other directly), and modified a few control flow instructions (call far and jmp far) to support the new control flow. Of course this failed for the same reason iAPX failed; and (as far as I know) nobody has ever used these things for the "global objects using cooperative flow control" they were originally designed for. Some people (e.g. very early Linux) did try to use the hardware task switching for more traditional "multi-process preemptive time sharing, with procedural programming" systems; but found that it was slow because the hardware task switch did too many protection checks that could be avoided by software task switching and saved/reloaded too much state that could be avoided by a software task switching;p and didn't do any of the other stuff needed for a task switch (e.g. keeping statistics of CPU time used, saving/restoring debug registers, etc).
Now.. Andrew S. Tanenbaum is a micro-kernel advocate. His ideal system consists of isolated pieces in user-space (processes, services, drivers, ...) communicating via. synchronous messaging. In practice (ignoring superficial differences in terminology) this "isolated pieces in user-space communicating via. synchronous messaging" is almost entirely identical to Intel's twice failed "global objects using cooperative flow control".
Mostly; in theory (if you ignore all the practical problems, like CPU not saving all of the state, and wanting to do extra work on task switches like tracking statistics), for a specific type of OS that Andrew S. Tanenbaum prefers (micro-kernel with synchronous message passing, without any thread priorities), it's plausible that the paragraph quoted above is more than just wishful thinking.

I think the answer to this can use a lot of OS and parallel distributive computing knowledge (And I am not sure about the answer but I will try my best)
So if you think about it. The library package will have a greater amount of performance than you write in the kernel itself. In the package thing, interrupt given by this code will be held at once and al the execution will be done. While when you write in kernel different other interrupts can come before. Plus accessing threads again and again is harsh on the kernel since everytime there will be an interrupt. I hope it will be a better view.

it's not correct to say the user-space threads are better that the kernel-space threads since each one has its own pros and cons.
in terms of user-space threads, as the application is responsible for managing thread, its easier to implement such threads and that kind of threads have not much reliance on OS. however, you are not able to use the advantages of multi processing.
In contrary, the kernel space modules are handled by OS, so you need to implement them according to the OS that you use, and it would be a more complicated task. However, you have more control over your threads.
for more comprehensive tutorial, take a look here.

Benefits of user-level threads

I was looking at the differences between user-level threads and kernel-level threads, which I basically understood.
What's not clear to me is the point of implementing user-level threads at all.
If the kernel is unaware of the existence of multiple threads within a single process, then which benefits could I experience?
I have read a couple of articles that stated user-level implementation of threads is advisable only if such threads do not perform blocking operations (which would cause the entire process to block).
This being said, what's the difference between a sequential execution of all the threads and a "parallel" execution of them, considering they cannot take advantage of multiple processors and independent scheduling?
An answer to a previously asked question (similar to mine) was something like:
No modern operating system actually maps n user-level threads to 1
kernel-level thread.
But for some reason, many people on the Internet state that user-level threads can never take advantage of multiple processors.
Could you help me understand this, please?

I strongly recommend Modern Operating Systems 4th Edition by Andrew S. Tanenbaum (starring in shows such as the debate about Linux; also participating: Linus Torvalds). Costs a whole lot of bucks but it's definitely worth it if you really want to know stuff. For eager students and desperate enthusiasts it's great.
Your questions answered
[...] what's not clear to me is the point of implementing User-level threads
at all.
Read my post. It is comprehensive, I daresay.
If the kernel is unaware of the existence of multiple threads within a
single process, then which benefits could I experience?
Read the section "Disadvantages" below.
I have read a couple of articles that stated that user-level
implementation of threads is advisable only if such threads do not
perform blocking operations (which would cause the entire process to
block).
Read the subsection "No coordination with system calls" in "Disadvantages."
All citations are from the book I recommended in the top of this answer, Chapter 2.2.4, "Implementing Threads in User Space."
Advantages
Enables threads on systems without threads
The first advantage is that user-level threads are a way to work with threads on a system without threads.
The first, and most obvious, advantage is that
a user-level threads package can be implemented on an operating system that does not support threads. All operating systems used to
fall into this category, and even now some still do.
No kernel interaction required
A further benefit is the light overhead when switching threads, as opposed to switching to the kernel mode, doing stuff, switching back, etc. The lighter thread switching is described like this in the book:
When a thread does something that may cause it to become blocked
locally, for example, waiting for another thread in its process to
complete some work, it calls a run-time system procedure. This
procedure checks to see if the thread must be put into blocked state.
If, so it stores the thread’s registers (i.e., its own) [...] and
reloads the machine registers with the new thread’s saved values. As soon as the stack
pointer and program counter have been switched, the new thread comes
to life again automatically. If the machine happens to have an
instruction to store all the registers and another one to load them
all, the entire thread switch can be done in just a handful of in-
structions. Doing thread switching like this is at least an order of
magnitude—maybe more—faster than trapping to the kernel and is a
strong argument in favor of user-level threads packages.
This efficiency is also nice because it spares us from incredibly heavy context switches and all that stuff.
Individually adjusted scheduling algorithms
Also, hence there is no central scheduling algorithm, every process can have its own scheduling algorithm and is way more flexible in its variety of choices. In addition, the "private" scheduling algorithm is way more flexible concerning the information it gets from the threads. The number of information can be adjusted manually and per-process, so it's very finely-grained. This is because, again, there is no central scheduling algorithm needing to fit the needs of every process; it has to be very general and all and must deliver adequate performance in every case. User-level threads allow an extremely specialized scheduling algorithm.
This is only restricted by the disadvantage "No automatic switching to the scheduler."
They [user-level threads] allow each process to have its own
customized scheduling algorithm. For some applications, for example,
those with a garbage-collector thread, not having to worry about a
thread being stopped at an inconvenient moment is a plus. They also
scale better, since kernel threads invariably require some table space
and stack space in the kernel, which can be a problem if there are a
very large number of threads.
Disadvantages
No coordination with system calls
The user-level scheduling algorithm has no idea if some thread has called a blocking read system call. OTOH, a kernel-level scheduling algorithm would've known because it can be notified by the system call; both belong to the kernel code base.
Suppose that a thread reads from the keyboard before any keys have
been hit. Letting the thread actually make the system call is
unacceptable, since this will stop all the threads. One of the main
goals of having threads in the first place was to allow each one to
use blocking calls, but to prevent one blocked thread from affecting
the others. With blocking system calls, it is hard to see how this
goal can be achieved readily.
He goes on that system calls could be made non-blocking but that would be very inconvenient and compatibility to existing OSes would be drastically hurt.
Mr Tanenbaum also says that the library wrappers around the system calls (as found in glibc, for example) could be modified to predict when a system cal blocks using select but he utters that this is inelegant.
Building upon that, he says that threads do block often. Often blocking requires many system calls. And many system calls are bad. And without blocking, threads become less useful:
For applications that are essentially entirely CPU bound and rarely
block, what is the point of having threads at all? No one would
seriously propose computing the first n prime numbers or playing chess
using threads because there is nothing to be gained by doing it that
way.
Page faults block per-process if unaware of threads
The OS has no notion of threads. Therefore, if a page fault occurs, the whole process will be blocked, effectively blocking all user-level threads.
Somewhat analogous to the problem of blocking system calls is the
problem of page faults. [...] If the program calls or jumps to an
instruction that is not in memory, a page fault occurs and the
operating system will go and get the missing instruction (and its
neighbors) from disk. [...] The process is blocked while the necessary
instruction is being located and read in. If a thread causes a page
fault, the kernel, unaware of even the existence of threads, naturally
blocks the entire process until the disk I/O is complete, even though
other threads might be runnable.
I think this can be generalized to all interrupts.
No automatic switching to the scheduler
Since there is no per-process clock interrupt, a thread acquires the CPU forever unless some OS-dependent mechanism (such as a context switch) occurs or it voluntarily releases the CPU.
This prevents usual scheduling algorithms from working, including the Round-Robin algorithm.
[...] if a thread starts running, no other thread in that process
will ever run unless the first thread voluntarily gives up the CPU.
Within a single process, there are no clock interrupts, making it
impossible to schedule processes round-robin fashion (taking turns).
Unless a thread enters the run-time system of its own free will, the scheduler will never get a chance.
He says that a possible solution would be
[...] to have the run-time system request a clock signal (interrupt) once a
second to give it control, but this, too, is crude and messy to
program.
I would even go on further and say that such a "request" would require some system call to happen, whose drawback is already explained in "No coordination with system calls." If no system call then the program would need free access to the timer, which is a security hole and unacceptable in modern OSes.

What's not clear to me is the point of implementing user-level threads at all.
User-level threads largely came into the mainstream due to Ada and its requirement for threads (tasks in Ada terminology). At the time, there were few multiprocessor systems and most multiprocessors were of the master/slave variety. Kernel threads simply did not exist. User threads had to be created to implement languages like Ada.
If the kernel is unaware of the existence of multiple threads within a single process, then which benefits could I experience?
If you have kernel threads, threads multiple threads within a single process can run simultaneously. In user threads, the threads always execute interleaved.
Using threads can simplify some types of programming.
I have read a couple of articles that stated user-level implementation of threads is advisable only if such threads do not perform blocking operations (which would cause the entire process to block).
That is true on Unix and maybe not all unix implementations. User threads on many operating systems function perfectly fine with blocking I/O.
This being said, what's the difference between a sequential execution of all the threads and a "parallel" execution of them, considering they cannot take advantage of multiple processors and independent scheduling?
In user threads. there is never parallel execution. In kernel threads, the can be parallel execution IF there are multiple processors. On a single processor system, there is not much advantage to using kernel threads over single threads (contra: note the blocking I/O issue on Unix and user threads).
But for some reason, many people on the Internet state that user-level threads can never take advantage of multiple processors.
In user threads, the process manages its own "threads" by interleaving execution within itself. The process can only have a thread run in the processor that the process is running in.
If the operating system provides system services to schedule code to run on a different processor, user threads could run on multiple processors.
I conclude by saying that for practicable purposes there are no advantages to user threads over kernel threads. There are those that will assert that there are performance advantages, but for there to be such an advantage it would be system dependent.

sched_yield slow down other threads

We have code that makes use of sched_yield inside a loop. When we do this we seem to get a slower performance of other threads, in particular those involving kernel calls (like IO and mutex/event handling). I'm trying to determine the exact cause of this behaviour.
Can excessive calls to sched_yield lead to a bottleneck in the kernel?
My suspicion is if we keep asking the kernel to check its process list then other threads will suffer as key data structures may be continually locked -- whereas if we didn't call sched_yield those kernel locks would tend to be uncontested. Does this make sense, or should it be totally okay to repeatedly call sched_yield.

Have a look at the sched_yield man page for Linux:
Avoid calling sched_yield()
unnecessarily or inappropriately (e.g., when resources needed by other
schedulable threads are still held by the caller), since doing so will result
in unnecessary context switches, which will degrade system performance.
Calling it in a tight loop will cause problems. Reduce the rate at which you're calling it.
(And check that you need to call it in the first place. The scheduler often does the Right Thing all by itself.)
Other options you could find interesting to investigate if you have a low priority thread:
sched_setscheduler - with SCHED_IDLE or SCHED_BATCH maybe (affects the whole process)
thread_setschedparam - per thread, but might have restrictions on what policies you can use (can't find it right now).
Or the good old nice command of course.

thread synchronization vs process synchronization

can we use the same synchronization mechanisams for both thread synchronization and process synchronization
what are thes synchronization mechanisams that are avilable only within the process

semaphores are generally what are used for multi process synchronization in terms of shared memory access, etc.
critical sections, mutexes and conditions are the more common tools for thread synchronization within a process.
generally speaking, the methods used to synchronize threads are not used to synchronize processes, but the reverse is usually not true. In fact its fairly common to use semaphores for thread synchronization.

There are several synchronization entities. They have different purposes and scope. Different languages and operating system implement them differently. On Windows, for one, you can use monitors for synching threads within a processes, or mutex for synching processes. There are semaphores, events, barriers... It all depends on the case. .NET provides so called slim versions that have improved performance but target only in-process synching.
One thing to remember though. Synching processes requires system resource, allocation and manipulation (locking and releasing) of which take quite a while.

An application consists of one or more
processes. A process, in the simplest
terms, is an executing program. One or
more threads run in the context of the
process. A thread is the basic unit to
which the operating system allocates
processor time. A thread can execute
any part of the process code,
including parts currently being
executed by another thread.
Ref.
As to specific synchronisation constructs, that will depend on the OS/Environment/language

One difference: Threads within a process have equal access to the memory of the process. Memory is typically private to a process, but can be explicitly shared.

linux thread synchronization

I am new to linux and linux threads. I have spent some time googling to try to understand the differences between all the functions available for thread synchronization. I still have some questions.
I have found all of these different types of synchronizations, each with a number of functions for locking, unlocking, testing the lock, etc.
gcc atomic operations
futexes
mutexes
spinlocks
seqlocks
rculocks
conditions
semaphores
My current (but probably flawed) understanding is this:
semaphores are process wide, involve the filesystem (virtually I assume), and are probably the slowest.
Futexes might be the base locking mechanism used by mutexes, spinlocks, seqlocks, and rculocks. Futexes might be faster than the locking mechanisms that are based on them.
Spinlocks dont block and thus avoid context swtiches. However they avoid the context switch at the expense of consuming all the cycles on a CPU until the lock is released (spinning). They should only should be used on multi processor systems for obvious reasons. Never sleep in a spinlock.
The seq lock just tells you when you finished your work if a writer changed the data the work was based on. You have to go back and repeat the work in this case.
Atomic operations are the fastest synch call, and probably are used in all the above locking mechanisms. You do not want to use atomic operations on all the fields in your shared data. You want to use a lock (mutex, futex, spin, seq, rcu) or a single atomic opertation on a lock flag when you are accessing multiple data fields.
My questions go like this:
Am I right so far with my assumptions?
Does anyone know the cpu cycle cost of the various options? I am adding parallelism to the app so we can get better wall time response at the expense of running fewer app instances per box. Performances is the utmost consideration. I don't want to consume cpu with context switching, spinning, or lots of extra cpu cycles to read and write shared memory. I am absolutely concerned with number of cpu cycles consumed.
Which (if any) of the locks prevent interruption of a thread by the scheduler or interrupt...or am I just an idiot and all synchonization mechanisms do this. What kinds of interruption are prevented? Can I block all threads or threads just on the locking thread's CPU? This question stems from my fear of interrupting a thread holding a lock for a very commonly used function. I expect that the scheduler might schedule any number of other workers who will likely run into this function and then block because it was locked. A lot of context switching would be wasted until the thread with the lock gets rescheduled and finishes. I can re-write this function to minimize lock time, but still it is so commonly called I would like to use a lock that prevents interruption...across all processors.
I am writing user code...so I get software interrupts, not hardware ones...right? I should stay away from any functions (spin/seq locks) that have the word "irq" in them.
Which locks are for writing kernel or driver code and which are meant for user mode?
Does anyone think using an atomic operation to have multiple threads move through a linked list is nuts? I am thinking to atomicly change the current item pointer to the next item in the list. If the attempt works, then the thread can safely use the data the current item pointed to before it was moved. Other threads would now be moved along the list.
futexes? Any reason to use them instead of mutexes?
Is there a better way than using a condition to sleep a thread when there is no work?
When using gcc atomic ops, specifically the test_and_set, can I get a performance increase by doing a non atomic test first and then using test_and_set to confirm? I know this will be case specific, so here is the case. There is a large collection of work items, say thousands. Each work item has a flag that is initialized to 0. When a thread has exclusive access to the work item, the flag will be one. There will be lots of worker threads. Any time a thread is looking for work, they can non atomicly test for 1. If they read a 1, we know for certain that the work is unavailable. If they read a zero, they need to perform the atomic test_and_set to confirm. So if the atomic test_and_set is 500 cpu cycles because it is disabling pipelining, causes cpu's to communicate and L2 caches to flush/fill .... and a simple test is 1 cycle .... then as long as I had a better ratio of 500 to 1 when it came to stumbling upon already completed work items....this would be a win.
I hope to use mutexes or spinlocks to sparilngly protect sections of code that I want only one thread on the SYSTEM (not jsut the CPU) to access at a time. I hope to sparingly use gcc atomic ops to select work and minimize use of mutexes and spinlocks. For instance: a flag in a work item can be checked to see if a thread has worked it (0=no, 1=yes or in progress). A simple test_and_set tells the thread if it has work or needs to move on. I hope to use conditions to wake up threads when there is work.
Thanks!

Application code should probably use posix thread functions. I assume you have man pages so type
man pthread_mutex_init
man pthread_rwlock_init
man pthread_spin_init
Read up on them and the functions that operate on them to figure out what you need.
If you're doing kernel mode programming then it's a different story. You'll need to have a feel for what you are doing, how long it takes, and what context it gets called in to have any idea what you need to use.

Thanks to all who answered. We resorted to using gcc atomic operations to synchronize all of our threads. The atomic ops were about 2x slower than setting a value without synchronization, but magnitudes faster than locking a mutex, changeing the value, and then unlocking the mutex (this becomes super slow when you start having threads bang into the locks...) We only use pthread_create, attr, cancel, and kill. We use pthread_kill to signal threads to wake up that we put to sleep. This method is 40x faster than cond_wait. So basicly....use pthreads_mutexes if you have time to waste.

in addtion you should check the nexts books
Pthreads Programming: A POSIX
Standard for Better Multiprocessing
and
Programming with POSIX(R) Threads

regarding question # 8
Is there a better way than using a condition to sleep a thread when there is no work?
yes i think that the best aproach instead of using sleep
is using function like sem_post() and sem_wait of "semaphore.h"
regards

A note on futexes - they are more descriptively called fast userspace mutexes. With a futex, the kernel is involved only when arbitration is required, which is what provides the speed up and savings.
Implementing a futex can be extremely tricky (PDF), debugging them can lead to madness. Unless you really, really, really need the speed, its usually best to use the pthread mutex implementation.
Synchronization is never exactly easy, but trying to implement your own in userspace makes it inordinately difficult.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string