What is the difference between spin_lock and raw_spin_lock()? - linux

There is raw variant of each spin lock available in linux kernel, I want to know its usage ? e.g. :
raw_spin_lock(), raw_spin_lock_irqsave(), etc.

spin_lock* functions do the same as raw_spin_lock* ones plus, when lock debugging is enabled(CONFIG_DEBUG_LOCK_ALLOC), perform some additional runtime checks for lock operations, such as checks for deadlock. These checks are performed by lockdep subsystem.
As a rule, spin_lock* functions should be used whenever it is possible.
Only in rare cases of very tricky locking policy, when lockdep can produce false warnings, raw_spin_lock* functions can be used.
Also, raw_* functions can be preferred to common ones for reduce memory usage or perfomance reasons. But it should be actual time/space measurements, reflected significant wins from using these optimizations.

The main difference is spin_lock variants map to raw_spin_lock variants for non-RT whereas if CONFIG_PREEMPT_RT is set, then they map to rt_spin_lock which can sleep.
By decoupling the spin_lock from sleeping vs non-sleeping variations depending on whether we are RT or not, the spin_lock API can be kept consistent across the kernel code.

Related

What happens when multiple threads try to access a critical section exactly at the same time?

I've being trying to find an answer for that, and all I could find it that once a thread reaches a critical section it locks it in front of other threads (or some other lock mechanism is being used to lock the critical section).
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Can I simply assume the the program will malfunction?
Note: I am referencing to a multicore CPUs.
Thanks.
I think you are missing the point of the fundamental locking primitives like Semaphores. If correct primitive is used, and used correctly, then the timing of the threads do not matter. They may well be simultaneous. The Operating System guarantees that no two thread will enter the critical section. Even on multicore machines, this bit is specially implemented (with lots of trickery even) to get that assurance.
To address your concerns specifically:
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
No. The other threads could have reached in the same microsecond, BUT if the locking mechanism is correct, then only one the competing threads will "enter" the critical section and others will wait.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.
Can I simply assume the the program will malfunction?
Ideally the program should not malfunction. But any code will have bugs - so does your code and the Operating System code for the Semaphores. So it is safe to assume that in some edge cases the program will indeed malfunction. But this assumption is true for any code in general.
Locking and Critical Sections are rather tricky to correctly implement. So for non academic purposes we should always use the system provided locking primitives. All Operating Systems expose stuff like Semaphores which most programming languages have ways to use. Some programming languages have their own lightweight implementations which provide somewhat softer guarantees but at a higher performance. As I said, while doing Critical Sections, it is critical to choose the correct thing and also to implement it correctly.
...But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Short answer; Memory system hardware makes it impossible for two different processors to access the same memory location at the same time. I'm not a computer architect, so I can't explain how it works, but the memory system serializes all of the accesses to the shared, main memory by the various CPUs in a multi-CPU system.
"Entering a critical section" means locking a mutex, and a mutex basically is just a flag in shared memory that is accesses by a specific protocol.
It is the task of the cache coherence protocol to make sure there are no 2 writes on the same chunk of memory (cache line) at the same time. With MESI there can be multiple readers of the same cacheline, but only 1 writer.
So if 2 threads at the same time want to write to the same cacheline, their requests will be serialized by cache coherence protocol.
Most CPU architecture support atomic operations like CAS. On the X86 this can be done using a lock prefix. The CPU will lock the cacheline when it starts with the CAS instruction and will not respond to cache coherence requests from other cores till it is finished with the atomic operation.
So if you would have 2 CPUs that both want to do a CAS, these operations are serialized by the underlying hardware.

Guaranteed CPU cache update after a certain time

Let's say I have a variable var located somewhere in memory and that an arbitrary number of processors/threads could read and modify it at any given time. But it's guaranteed that at least n seconds will have elapsed between a processor modifying var and any other one reading var. Is it possible to be certain that, if time in seconds is n, there's a value for n that guarantees that the processor reading var will read the updated value?
If your concern really is Cache coherence you should generally be safe 1.
Specifically, however, you may be not.
Cache coherence is usually handled by the hardware2 without the help of the software.
However this is very implementation specific: NUMA may be non cache-coherent, a Compute Shader may need specific built-in functions, IA32e and ARM generally hide cache coherence from the programmer.
To answer you question directly: No, you have no guarantees whatsoever.
The point is that cache coherence is something you deal with in clustered and parallel non uniform architectures.
While in this situations the programming model is inherently multi-threading, the two concepts3 are separated and what really should bug you is how to properly handle multi-threading, specifically synchronization and memory order.
Your question seems to suggest a simple case, where the readers are executed long after the writer is done.
If this property is really enforced you don't need any synchronization nor memory barrier. Beware however that sleep functions don't qualify as a valid enforcement.
If you instead need to synchronize (and so to order the memory accesses) then you need to use language specific constructs, for example volatile in C# and Java, atomics in C and C++ or specific instructions in assembly.
You may need to implement Critical sections too.
If you actually need to manually control the cache coherence for your architecture, than you have to check the specifications of interest (usually datasheets and formal papers) because there is no uniform way to deal with it and the compiler should provide some intrinsic or the runtime should provide a library.
So to add something to the direct answer above: No, you have no guarantees whatsoever, but when an usual CPU, in an usual architecture, need that data, it will be able to use the most updated one anyway. So you don't need to worry about that aspect.
Please note the use of the words common and that
1 For example if you use an Intel/AMD/ARM CPU, don't even think about cache coherence.
2 Either the CPU itself, a local monitor, a system monitor or a specific device.
3 Multi-threading and cache-coherence.
The cache will tend to get flushed on operating system tick interrupts when it goes into the scheduler to see if there's a different task to run.
However, as operating systems get smarter with things such as tickless NoHz and as CPU core counts go up, this gets less and less likely, and you shouldn't count on it.
Supercomputer clusters may not task switch for minutes at a time because they're using customized operating system code that doesn't interrupt the running jobs, ever. Compute jobs are assigned to a core from 1-7 with no interrupts and all of the other work runs on core-0.
There are two concepts mixed in you question: software synchronization and hardware coherency. Hardware coherency is talked by Margaret already so I won't cover it here.
Software Synchronization
x86 provides guarantee that quadword access would be carried out atomically if aligned on 64-bit boundary. But this guarantees that other processor won't read partial result (e.g. [32bit New]<32bit Old> weird mixture). It does not guarantee a hard time deadline before which another processor would see the newly assigned value. Let another thread wait for some time is not quite an elegant solution because first the two threads need to have the same starting time synchronized. So, if you need such guarantee, you need conditional variable to make sure another thread should wait.
https://en.wikipedia.org/wiki/Monitor_(synchronization)
In a word, use conditional variable if you need a sequencing effect and use locks/transactional memory, etc. to protect variable longer than quadword or not 64bit aligned.
Btw, here is an useful material for cache coherency if you are interested.
http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/10_coherence.pdf

Why cannot a Lock for `2`-threads be implemented using only `1` shared variable satisfying mutual exclusion and deadlock freedom?

I've been working a lot with concurrency at the practical level, and therefore I've also started to study it theoretically to gain insight into this field of computer science.
However, I've trouble understanding the following:
Why cannot a Lock for 2-threads be implemented using only 1 shared variable satisfying mutual exclusion and deadlock freedom?
More generally, why is at least n shared variables needed for a n-thread lock satisfying mutual exclusion and deadlock freedom?
Consider two threads A and B. I see that A must write to this variable in order to signify it acquires the lock. The variable could be a boolean. Is it because that A needs to read the variable before writing it, and this is two operations? (not done atomically)
Most likely, you're reading things that make assumptions about the platform's capabilities that are no longer realistic. You're probably considering the case where a CPU has no prefetching, no posted writes, total read and store ordering, no compiler optimization that affect memory visibility or memory operation ordering, and no risk of word tearing, but does not have an atomic "read-modify-write" operation like increment or compare-exchange. With these assumptions, there's really no way to do it with one variable.
This is an interesting theoretical problem, but has very little practical relevance. Modern CPUs do have all of those optimizations -- they prefetch reads, they post writes to buffers, they re-order reads and stores, and compilers optimize away memory options. Word tearing is typically not an issue for aligned operations to native integer types. But, more importantly, modern CPUs have sophisticated, high-performance atomic operations such as increment, decrement, compare-exchange, and so on.
When you write synchronization primitives, the exercise is highly platform-specific. The combination of capabilities available to you varies from platform to platform. Even more importantly, their costs vary drastically from platform to platform, so even if many solutions are possible, they may not be equally good.
Lastly, you have to have a deep understanding of what each primitive actually makes the platform do. For example, on modern Intel CPUs, there is hyper-threading. It's important that, for example, a thread waiting for a spinlock doesn't starve another thread sharing the physical core. That requires deep understanding of how hyper-threading actually works. Similarly, it's easy to code a spinlock so that you take the mother of all mispredicted branches when you acquire the lock and blow out the pipelines at the instant where performance is the most critical. You need to understand how branch prediction works and how it interacts with instruction pipelining to avoid this issue.
The vast majority of programmers should never, ever write synchronization primitives and use them in actual, real world code. Getting them to work with assured reliability is hard, and getting them to perform properly is much, much harder. And to top it off, it's not possible to measure their performance easily. (Of course, it's great to experiment, so long as you don't get an exaggerated sense of the usefulness of your experimental code.)

c - kernel - spinlocks vs queues

I think, no matter the whole lot of documentation available, I don't understand why one have to wait for a spin lock in a kernel context.
Why isn't there a specific queue with process requiring a lock with an atomic counter/index and , with preempt disabled, treat them as they come in this list and when the counter is down to 0 on thislist, go back to the main schedule list ?
Two situations :
system underloaded, maybe the spinlock is faster (depends on the lock concurrency at this moment);
system heavily loaded, maybe this strategy is faster (no more wait).
I may miss something very smart here, and I would like to understand it, please.
Thank you
Spinlocks are primarily for use in (or to interoperate with) contexts that cannot block / reschedule. They should only be used where the likelihood of actually waiting for them is relatively low and the lock will not be held long. For example, assume an interrupt handler (and/or other contexts as well) has created a data structure and needs to link it into a doubly-linked list. That will only take nanoseconds to complete and the likelihood of colliding with another process is low, yet it must have an atomic effect: no other cpu/thread should see the list in an intermediate (partially linked) state.

Usage of registers by the compiler in multithreaded program

It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!
Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.
'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.
In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...

Resources