Single write - single read big memory buffer sharing without locks - multithreading

Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)

Related

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

Do I need always lock before read even if very occasionally write to the memory?

I have a shared data structure that is read in one thread and modified in another thread. However, its data changes very occasionally. Most of time, it is read by the thread. I now have a Mutex (or RW lock) locked before read/write and unlocked after read/write.
Because the data rarely changes, lock-unlock every time it is read seems inefficient. If no change is made to the data, I can get rid of the lock because only read to the same structure can run simultaneously without lock.
My question is:
Is there a lock-free solution that allows me changes the data without a lock?
Or, the lock-unlock in read (one thread, in other words, no contention) don't take much of time/resources (no enter to the kernel) at all?
If there's no contention, not kernel call is needed, but still atomic lock acquisition is needed. If the resource is occupied for a short period of time, then spinning can be attempted before kernel call.
Mutex and RW lock implementations, such as (an usual quality implementation of) std::mutex / std::shared_mutex in C++ or CRITICAL_SECTION / SRW_LOCK in Windows already employ above mentioned techniques on their own. Linux mutexes are usually based on futex, so they also avoid kernel call when it its not needed. So you don't need to bother about saving a kernel call yourself.
And there are alternatives to locking. There are atomic types that can be accessed using lock-free reads and writes, they can always avoid lock. There are other patterns, such as SeqLock. There is transaction memory.
But before going there, you should make sure that locking is performance problem. Because use of atomics may be not simple (although it is simple for some languages and simple cases), and other alternatives have their own pitfalls.
An uncontrolled data race may be dangerous. Maybe not. And there may be very thin boundary between cases where it is and where it is not. For example, copying a bunch of integer could only result in garbage integers occasionally obtained, if integers are properly sized and aligned, then there may be only a mix up, but not garbage value of a single integer, and if you add some more complex type, say string, you may have a crash. So most of the times uncontrolled data race is treated as Undefined Behavior.

Is it safe to update an object in a thread without locks if other threads will not access it?

I have a vector of entities. At update cycle I iterate through vector and update each entity: read it's position, calculate current speed, write updated position. Also, during updating process I can change some other objects in other part of program, but each that object related only to current entity and other entities will not touch that object.
So, I want to run this code in threads. I separate vector into few chunks and update each chunk in different threads. As I see, threads are fully independent. Each thread on each iteration works with independent memory regions and doesn't affect other threads work.
Do I need any locks here? I assume, that everything should work without any mutexes, etc. Am I right?
Short answer
No, you do not need any lock or synchronization mechanism as your problem appear to be a embarrassingly parallel task.
Longer answer
A race conditions that can only appear if two threads might access the same memory at the same time and at least one of the access is a write operation. If your program exposes this characteristic, then you need to make sure that threads access the memory in an ordered fashion. One way to do it is by using locks (it is not the only one though). Otherwise the result is UB.
It seems that you found a way to split the work among your threads s.t. each thread can work independently from the others. This is the best case scenario for concurrent programming as it does not require any synchronization. The complexity of the code is dramatically decreased and usually speedup will jump up.
Please note that as #acelent pointed out in the comment section, if you need changes made by one thread to be visible in another thread, then you might need some sort of synchronization due to the fact that depending on the memory model and on the HW changes made in one thread might not be immediately visible in the other.
This means that you might write from Thread 1 to a variable and after some time read the same memory from Thread 2 and still not being able to see the write made by Thread 1.
"I separate vector into few chunks and update each chunk in different threads" - in this case you do not need any lock or synchronization mechanism, however, the system performance might degrade considerably due to false sharing depending on how the chunks are allocated to threads. Note that the compiler may eliminate false sharing using thread-private temporal variables.
You can find plenty of information in books and wiki. Here is some info https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
Also there is a stackoverflow post here does false sharing occur when data is read in openmp?

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.
A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.
There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Deciding the critical section of kernel code

Hi I am writing kernel code which intends to do process scheduling and multi-threaded execution. I've studied about locking mechanisms and their functionality. Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking (mutex/semaphores/spinlocks)?
I know that where ever there is chance of concurrency in part of code, we require lock. But how do we decide, what if we miss and test cases don't catch them. Earlier I wrote code for system calls and file systems where I never cared about taking locks.
Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking?
Any object (global variable, field of the structure object, etc.), accessed concurrently when one access is write access requires some locking discipline for access.
But how do we decide, what if we miss and test cases don't catch them?
Good practice is appropriate comment for every declaration of variable, structure, or structure field, which requires locking discipline for access. Anyone, who uses this variable, reads this comment and writes corresponded code for access. Kernel core and modules tend to follow this strategy.
As for testing, common testing rarely reveals concurrency issues because of their low probability. When testing kernel modules, I would advice to use Kernel Strider, which attempts to prove correctness of concurrent memory accesses or RaceHound, which increases probability of concurrent issues and checks them.
It is always safe to grab a lock for the duration of any code that accesses any shared data, but this is slow since it means only one thread at a time can run significant chunks of code.
Depending on the data in question though, there may be shortcuts that are safe and fast. If it is a simple integer ( and by integer I mean the native word size of the CPU, i.e. not a 64 bit on a 32 bit cpu ), then you may not need to do any locking: if one thread tries to write to the integer, and the other reads it at the same time, the reader will either get the old value, or the new value, never a mix of the two. If the reader doesn't care that he got the old value, then there is no need for a lock.
If however, you are updating two integers together, and it would be bad for the reader to get the new value for one and the old value for the other, then you need a lock. Another example is if the thread is incrementing the integer. That normally involves a read, add, and write. If one reads the old value, then the other manages to read, add, and write the new value, then the first thread adds and writes the new value, both believe they have incremented the variable, but instead of being incremented twice, it was only incremented once. This needs either a lock, or the use of an atomic increment primitive to ensure that the read/modify/write cycle can not be interrupted. There are also atomic test-and-set primitives so you can read a value, do some math on it, then try to write it back, but the write only succeeds if it still holds the original value. That is, if another thread changed it since the time you read it, the test-and-set will fail, then you can discard your new value and start over with a read of the value the other thread set and try to test-and-set it again.
Pointers are really just integers, so if you set up a data structure then store a pointer to it where another thread can find it, you don't need a lock as long as you set up the structure fully before you store its address in the pointer. Another thread reading the pointer ( it will need to make sure to read the pointer only once, i.e. by storing it in a local variable then using only that to refer to the structure from then on ) will either see the new structure, or the old one, but never an intermediate state. If most threads only read the structure via the pointer, and any that want to write do so either with a lock, or an atomic test-and-set of the pointer, this is sufficient. Any time you want to modify any member of the structure though, you have to copy it to a new one, change the new one, then update the pointer. This is essentially how the kernel's RCU ( read, copy, update ) mechanism works.
Ideally, you must enumerate all the resources available in your system , the related threads and communication, sharing mechanism during design. Determination of the following for every resource and maintaining a proper check list whenever change is made can be of great help :
The duration for which the resource will be busy (Utilization of resource) & type of lock
Amount of tasks queued upon that particular resource (Load) & priority
Type of communication, sharing mechanism related to resource
Error conditions related to resource
If possible, it is better to have a flow diagram depicting the resources, utilization, locks, load, communication/sharing mechanism and errors.
This process can help you in determining the missing scenarios/unknowns, critical sections and also in identification of bottlenecks.
On top of the above process, you may also need certain tools that can help you in testing / further analysis to rule out hidden problems if any :
Helgrind - a Valgrind tool for detecting synchronisation errors.
This can help in identifying data races/synchronization issues due
to improper locking, the lock ordering that can cause deadlocks and
also improper POSIX thread API usage that can have later impacts.
Refer : http://valgrind.org/docs/manual/hg-manual.html
Locksmith - For determining common lock errors that may arise during
runtime or that may cause deadlocks.
ThreadSanitizer - For detecting race condtion. Shall display all accesses & locks involved for all accesses.
Sparse can help to lists the locks acquired and released by a function and also identification of issues such as mixing of pointers to user address space and pointers to kernel address space.
Lockdep - For debugging of locks
iotop - For determining the current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
LTTng - For tracing race conditions and interrupt cascades possible. (A successor to LTT - Combination of kprobes, tracepoint and perf functionalities)
Ftrace - A Linux kernel internal tracer for analysing /debugging latency and performance related issues.
lsof and fuser can be handy in determining the processes having lock and the kind of locks.
Profiling can help in determining where exactly the time is being spent by the kernel. This can be done with tools like perf, Oprofile.
The strace can intercept/record system calls that are called by a process and also the signals that are received by a process. It shall show the order of events and all the return/resumption paths of calls.

Resources