mechanism apart from locking(spin lock, sempahore) for Synchronization - multithreading

I know for synchronization in c there are several mechanism like spin lock, semaphore, seq_lock, conditional variable etc each has it's pros and cons and when to use which is depend on situation.
But every synchronization mechanism above add some extra processing to the program.
This is an interview question "Is there any other way apart from locking for Synchronization". I said we can use Barriers or our own wait_queue but this might be useful is only some situation not in some may be in 1 or 2.
So is there any mechanism apart from locking(spin lock, semaphore) for Synchronization ?

Locking is expensive in the kernel. In addition to the above you have RCU: Read-Copy-Update which separates update and reclaimation information, where both readers and writers can avoid locking altogether.
Its not an complete alternative to the above as it depends on what you are trying to serialize. You could also consider using per-cpu data structures and avoid global expensive locks, this still requires synchronizing with the ISR's while preemption must be disabled. Forgot to add atomic operations on bits and integers.

Related

What is the difference between an atomic operation and critical section?, which of the two prevents context switching?

A programming language or the processor already has "default" atomic operations and we can use them as far as I understand.
https://en.wikipedia.org/wiki/Linearizability
What is the difference between an atomic operation and critical section?
Atomic operations are instructions that guarantee atomic accesses/updates of shared (small) variables. This generally include operations like incrementation, decrementation, addition, subtraction, compare and swap (aka. CAS), exchange, logical operations (and, or, xor) as well as basic loads/stores. If you want to perform a non trivial operation that is not supported by the target platform (or one involving large variables), then you cannot use one atomic operation. This means either multiple of them is required or another mechanism should be used instead (eg. critical section, transactional memory). Note that using multiple atomic operations often makes things significantly more complex (see ABA problem). On mainstream CPUs, atomic operations are generally implemented by locking cache lines of shared caches (eg. L3) so that only one thread can access to it at a time.
Critical sections are meant to protect one or multiple instructions from being executed by multiple threads at the same time. They are generally protected using a system mutex. The thread entering the critical section lock the associated mutex and unlock it when leaving the section. System mutexes cause the thread entering a critical section to wait if the associated mutex is already locked. This is generally done using a context switch (the thread is descheduled and rescheduled later).
Critical section can be efficient when the lock is very rarely already taken by another thread. Context switches can significantly impact the performance. Atomic operation are not great either when many thread perform atomic operations on it. Contention effects can make atomic accesses significantly slower (eg. spin locks). This is especially true for atomic CAS operations. Some platform can execute atomic operation very quickly (eg. GPUs) since they have dedicated units to execute atomic operation efficiently.
which of the two prevents context switching?
None of the two prevent context switching. Modern operating systems can perform a context switching at any time. That being said, critical section generally cause context switches: a thread trying to enter into a critical section already locked by another thread will typically enter in sleeping mode and be awaken by the OS scheduler when the other thread will unlock the section. Atomic operations do not impact the scheduling of the system (at least not on mainstream platforms).
Note that the above text is also true for processes.
Speaking only to the nomenclature question:
"Atomic" means "cannot be broken down into smaller parts." In programming, an operation performed by one thread is "atomic" (as seen from other threads) if there is no possible way for the other threads to see the operation in a half-way done state. From the point of view of other threads, it's as if the entire operation happened in a single instant. It either has already happened, or it hasn't happened yet. There is no in between.
As Jérôme Richard points out, modern computer hardware provides atomic operations on simple variables. We can use those to make more complex operations seem "atomic" from the point of view of other threads either by using the hardware atomics in tricky non-blocking algorithms, or by using the hardware atomics in the implementation of mutex locks.
"Critical section" comes from a time before multi-threading. In operating system kernel code, and in "bare metal" application code, there has always been a limited form of concurrency between the main body of code and the interrupt handlers. "Critical section," back in the day, referred to a routine in the main body of code that was protected from interference by the interrupt handlers by executing it with interrupts disabled.
Systems programmers today still use "critical section" with the original meaning, but now we also sometimes say it to talk about a routine that is executed by a thread while the thread has a mutex locked.
IMO, "critical section" encourages a somewhat less useful way of thinking about mutex locks though because it's never the code that needs protection from interference. It's always about protecting the integrity of shared data. Sometimes a programmer who worries about defining The critical section can lose sight of the fact that there may be multiple routines in the program that all access the same shared data.
IMO, this is one place where an object-oriented style of programming shines, because it's easier to keep track of what needs to be protected if it is encapsulated in private members of some object and, can only be accessed through the object's thread-safe, public methods.

Do I ever *not* want to use a read/write lock instead of a vanilla mutex?

When synchronizing access to a shared resource, is there ever a reason not to use a read/write lock instead of a vanilla mutex (which is basically just a write lock), besides the philosophical reason of it having more features than I may need?
In other words, if I just default to read/write locks as my preferred synchronization construct of choice, am I shooting myself in the foot?
It seems to me that one good rationale for always choosing a read/write lock, and using the read vs. write lock accordingly, is I can implement some synchronization, then never have to think about it again while gaining the possible benefit of better performance scalability in the future if one day I drop the code into a more high-contention environment. So assuming it has a potential benefit with no actual cost, it would make sense to use it all the time. Does that make sense?
This is on a system that isn't really resource limited, it's probably more of a performance question. Also I've phrased this question generally but I have Qt's QReadWriteLock and QMutex (C++) specifically in mind, if it matters.
In practice, the write lock in a read/write lock pair is more expensive than a simple mutex. Read/write locks always have some coordination strategy that must be applied when you acquire or release a lock. Depending on the particular implementation, this strategy can be cheap or expensive but it always exists.
In case of QReadWriteLock, there's some logic that gives priority to writers. Even though the implementation of such logic might be efficient, and there're no readers in the waiting queue, it's never totally free.
I'm not familiar with all the details of the QMutex and QReadWriteLock implementation, but the documentation says that QMutex is heavily optimized for non-contended cases. QReadWriteLock has no such remark. Maybe because they just forgot to make such note, but maybe because its behaviour under such circumstances is not as good as of QMutex.
I think, in the best scenario, a penalty for using read/write lock is negligible. But in the worst scenario, when you fight for every nanosecond, it might be noticeable.
It really boils down on the characteristic of the congestion of your lock. The case when the simple mutex will perform significanly better is heavy congestion with writer preference.
This is a very lengthy and debatable subject. May I recommend Spinlocks and Read-Write Locks and Sleeping Read-Write Locks reading, so you can make an educated decision?

Linux Kernel - Can I lock and unlock Spinlock in different functions?

I'm new to Kernel programming and programming with locks.
Is it safe to lock and unlock a spinlock in different functions? I am doing this to synchronize the code flow.
Also, is it safe to use spinlock (lock & unlock) in __schedule()? Is it safe to keep the scheduler waiting to acquire a lock?
Thanks in advance.
Instead of spinlock, you can use a semaphore or a mutex. You should use spinlock in the same function for the littlest set of operations.
A good reason of NOT using spinlock / unlock from different function isn't so obvious.
One big and a very-very good reason not to to it it's the fact that when you spinlock on it sets a flag ATOMIC in a scheduler struct - and your kernel becomes ATOMIC context from this moment and up to moment you unlock the spinlock. Try it with the kernel compiled with debug flags - you'll see a lot of BUG messages in your klog.
Good luck.
If you design your code correctly, there is no harm in acquiring and releasing the same spinlock from multiple locations, in fact, that's pretty much the point of it; you can use a single spinlock to implement a set of functions that are similar to the Linux atomic operations but with whatever additional internal complexity you need. As long as within each function you acquire and release the lock around the shared resource(s), it should work just fine.
The main considerations are:
keep the code between each claim/release pair as brief as possible - it's an atomic context
this will work fine on a single core system and scale to pre-emptive SMP
you still need to consider what type of code you are implementing and what context(s) it might be running on, and use the correct type of spinlock for that
As long as you treat spinlocks with care - keeping in mind the potential for deadlocks - and understand that anything you do within the spinlock can affect system latency, then they are a very useful tool.
If you know that all the areas in your code where you've claimed the lock always complete and release quickly then you can be equally sure that any other bit of your code won't ever be spinning for ages waiting on the lock. This is potentially much more efficient that using a mutex.
The other value of taking the spinlock is that it acts as an implicit memory barrier, so by taking a lock around manipulating some resource (e.g. a member of a structure) you can be sure that any other thread through your code which also takes the lock before reading/writing that resource is seeing the current state of it, and not some out-of-date value due to cache coherency issues.
It's a potentially complex subject but hopefully that explanation helps a bit.

linux thread synchronization

I am new to linux and linux threads. I have spent some time googling to try to understand the differences between all the functions available for thread synchronization. I still have some questions.
I have found all of these different types of synchronizations, each with a number of functions for locking, unlocking, testing the lock, etc.
gcc atomic operations
futexes
mutexes
spinlocks
seqlocks
rculocks
conditions
semaphores
My current (but probably flawed) understanding is this:
semaphores are process wide, involve the filesystem (virtually I assume), and are probably the slowest.
Futexes might be the base locking mechanism used by mutexes, spinlocks, seqlocks, and rculocks. Futexes might be faster than the locking mechanisms that are based on them.
Spinlocks dont block and thus avoid context swtiches. However they avoid the context switch at the expense of consuming all the cycles on a CPU until the lock is released (spinning). They should only should be used on multi processor systems for obvious reasons. Never sleep in a spinlock.
The seq lock just tells you when you finished your work if a writer changed the data the work was based on. You have to go back and repeat the work in this case.
Atomic operations are the fastest synch call, and probably are used in all the above locking mechanisms. You do not want to use atomic operations on all the fields in your shared data. You want to use a lock (mutex, futex, spin, seq, rcu) or a single atomic opertation on a lock flag when you are accessing multiple data fields.
My questions go like this:
Am I right so far with my assumptions?
Does anyone know the cpu cycle cost of the various options? I am adding parallelism to the app so we can get better wall time response at the expense of running fewer app instances per box. Performances is the utmost consideration. I don't want to consume cpu with context switching, spinning, or lots of extra cpu cycles to read and write shared memory. I am absolutely concerned with number of cpu cycles consumed.
Which (if any) of the locks prevent interruption of a thread by the scheduler or interrupt...or am I just an idiot and all synchonization mechanisms do this. What kinds of interruption are prevented? Can I block all threads or threads just on the locking thread's CPU? This question stems from my fear of interrupting a thread holding a lock for a very commonly used function. I expect that the scheduler might schedule any number of other workers who will likely run into this function and then block because it was locked. A lot of context switching would be wasted until the thread with the lock gets rescheduled and finishes. I can re-write this function to minimize lock time, but still it is so commonly called I would like to use a lock that prevents interruption...across all processors.
I am writing user code...so I get software interrupts, not hardware ones...right? I should stay away from any functions (spin/seq locks) that have the word "irq" in them.
Which locks are for writing kernel or driver code and which are meant for user mode?
Does anyone think using an atomic operation to have multiple threads move through a linked list is nuts? I am thinking to atomicly change the current item pointer to the next item in the list. If the attempt works, then the thread can safely use the data the current item pointed to before it was moved. Other threads would now be moved along the list.
futexes? Any reason to use them instead of mutexes?
Is there a better way than using a condition to sleep a thread when there is no work?
When using gcc atomic ops, specifically the test_and_set, can I get a performance increase by doing a non atomic test first and then using test_and_set to confirm? I know this will be case specific, so here is the case. There is a large collection of work items, say thousands. Each work item has a flag that is initialized to 0. When a thread has exclusive access to the work item, the flag will be one. There will be lots of worker threads. Any time a thread is looking for work, they can non atomicly test for 1. If they read a 1, we know for certain that the work is unavailable. If they read a zero, they need to perform the atomic test_and_set to confirm. So if the atomic test_and_set is 500 cpu cycles because it is disabling pipelining, causes cpu's to communicate and L2 caches to flush/fill .... and a simple test is 1 cycle .... then as long as I had a better ratio of 500 to 1 when it came to stumbling upon already completed work items....this would be a win.
I hope to use mutexes or spinlocks to sparilngly protect sections of code that I want only one thread on the SYSTEM (not jsut the CPU) to access at a time. I hope to sparingly use gcc atomic ops to select work and minimize use of mutexes and spinlocks. For instance: a flag in a work item can be checked to see if a thread has worked it (0=no, 1=yes or in progress). A simple test_and_set tells the thread if it has work or needs to move on. I hope to use conditions to wake up threads when there is work.
Thanks!
Application code should probably use posix thread functions. I assume you have man pages so type
man pthread_mutex_init
man pthread_rwlock_init
man pthread_spin_init
Read up on them and the functions that operate on them to figure out what you need.
If you're doing kernel mode programming then it's a different story. You'll need to have a feel for what you are doing, how long it takes, and what context it gets called in to have any idea what you need to use.
Thanks to all who answered. We resorted to using gcc atomic operations to synchronize all of our threads. The atomic ops were about 2x slower than setting a value without synchronization, but magnitudes faster than locking a mutex, changeing the value, and then unlocking the mutex (this becomes super slow when you start having threads bang into the locks...) We only use pthread_create, attr, cancel, and kill. We use pthread_kill to signal threads to wake up that we put to sleep. This method is 40x faster than cond_wait. So basicly....use pthreads_mutexes if you have time to waste.
in addtion you should check the nexts books
Pthreads Programming: A POSIX
Standard for Better Multiprocessing
and
Programming with POSIX(R) Threads
regarding question # 8
Is there a better way than using a condition to sleep a thread when there is no work?
yes i think that the best aproach instead of using sleep
is using function like sem_post() and sem_wait of "semaphore.h"
regards
A note on futexes - they are more descriptively called fast userspace mutexes. With a futex, the kernel is involved only when arbitration is required, which is what provides the speed up and savings.
Implementing a futex can be extremely tricky (PDF), debugging them can lead to madness. Unless you really, really, really need the speed, its usually best to use the pthread mutex implementation.
Synchronization is never exactly easy, but trying to implement your own in userspace makes it inordinately difficult.

Are "benaphores" worth implementing on modern OS's?

Back in my days as a BeOS programmer, I read this article by Benoit Schillings, describing how to create a "benaphore": a method of using atomic variable to enforce a critical section that avoids the need acquire/release a mutex in the common (no-contention) case.
I thought that was rather clever, and it seems like you could do the same trick on any platform that supports atomic-increment/decrement.
On the other hand, this looks like something that could just as easily be included in the standard mutex implementation itself... in which case implementing this logic in my program would be redundant and wouldn't provide any benefit.
Does anyone know if modern locking APIs (e.g. pthread_mutex_lock()/pthread_mutex_unlock()) use this trick internally? And if not, why not?
What your article describes is in common use today. Most often it's called "Critical Section", and it consists of an interlocked variable, a bunch of flags and an internal synchronization object (Mutex, if I remember correctly). Generally, in the scenarios with little contention, the Critical Section executes entirely in user mode, without involving the kernel synchronization object. This guarantees fast execution. When the contention is high, the kernel object is used for waiting, which releases the time slice conductive for faster turnaround.
Generally, there is very little sense in implementing synchronization primitives in this day and age. Operating systems come with a big variety of such objects, and they are optimized and tested in significantly wider range of scenarios than a single programmer can imagine. It literally takes years to invent, implement and test a good synchronization mechanism. That's not to say that there is no value in trying :)
Java's AbstractQueuedSynchronizer (and its sibling AbstractQueuedLongSynchronizer) works similarly, or at least it could be implemented similarly. These types form the basis for several concurrency primitives in the Java library, such as ReentrantLock and FutureTask.
It works by way of using an atomic integer to represent state. A lock may define the value 0 as unlocked, and 1 as locked. Any thread wishing to acquire the lock attempts to change the lock state from 0 to 1 via an atomic compare-and-set operation; if the attempt fails, the current state is not 0, which means that the lock is owned by some other thread.
AbstractQueuedSynchronizer also facilitates waiting on locks and notification of conditions by maintaining CLH queues, which are lock-free linked lists representing the line of threads waiting either to acquire the lock or to receive notification via a condition. Such notification moves one or all of the threads waiting on the condition to the head of the queue of those waiting to acquire the related lock.
Most of this machinery can be implemented in terms of an atomic integer representing the state as well as a couple of atomic pointers for each waiting queue. The actual scheduling of which threads will contend to inspect and change the state variable (via, say, AbstractQueuedSynchronizer#tryAcquire(int)) is outside the scope of such a library and falls to the host system's scheduler.

Resources