Usage of registers by the compiler in multithreaded program

Usage of registers by the compiler in multithreaded program - multithreading

It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!

Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.

'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.

In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...

Related

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?

a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

Guaranteed CPU cache update after a certain time

Let's say I have a variable var located somewhere in memory and that an arbitrary number of processors/threads could read and modify it at any given time. But it's guaranteed that at least n seconds will have elapsed between a processor modifying var and any other one reading var. Is it possible to be certain that, if time in seconds is n, there's a value for n that guarantees that the processor reading var will read the updated value?

If your concern really is Cache coherence you should generally be safe 1.
Specifically, however, you may be not.
Cache coherence is usually handled by the hardware2 without the help of the software.
However this is very implementation specific: NUMA may be non cache-coherent, a Compute Shader may need specific built-in functions, IA32e and ARM generally hide cache coherence from the programmer.
To answer you question directly: No, you have no guarantees whatsoever.
The point is that cache coherence is something you deal with in clustered and parallel non uniform architectures.
While in this situations the programming model is inherently multi-threading, the two concepts3 are separated and what really should bug you is how to properly handle multi-threading, specifically synchronization and memory order.
Your question seems to suggest a simple case, where the readers are executed long after the writer is done.
If this property is really enforced you don't need any synchronization nor memory barrier. Beware however that sleep functions don't qualify as a valid enforcement.
If you instead need to synchronize (and so to order the memory accesses) then you need to use language specific constructs, for example volatile in C# and Java, atomics in C and C++ or specific instructions in assembly.
You may need to implement Critical sections too.
If you actually need to manually control the cache coherence for your architecture, than you have to check the specifications of interest (usually datasheets and formal papers) because there is no uniform way to deal with it and the compiler should provide some intrinsic or the runtime should provide a library.
So to add something to the direct answer above: No, you have no guarantees whatsoever, but when an usual CPU, in an usual architecture, need that data, it will be able to use the most updated one anyway. So you don't need to worry about that aspect.
Please note the use of the words common and that
1 For example if you use an Intel/AMD/ARM CPU, don't even think about cache coherence.
2 Either the CPU itself, a local monitor, a system monitor or a specific device.
3 Multi-threading and cache-coherence.

The cache will tend to get flushed on operating system tick interrupts when it goes into the scheduler to see if there's a different task to run.
However, as operating systems get smarter with things such as tickless NoHz and as CPU core counts go up, this gets less and less likely, and you shouldn't count on it.
Supercomputer clusters may not task switch for minutes at a time because they're using customized operating system code that doesn't interrupt the running jobs, ever. Compute jobs are assigned to a core from 1-7 with no interrupts and all of the other work runs on core-0.

There are two concepts mixed in you question: software synchronization and hardware coherency. Hardware coherency is talked by Margaret already so I won't cover it here.
Software Synchronization
x86 provides guarantee that quadword access would be carried out atomically if aligned on 64-bit boundary. But this guarantees that other processor won't read partial result (e.g. [32bit New]<32bit Old> weird mixture). It does not guarantee a hard time deadline before which another processor would see the newly assigned value. Let another thread wait for some time is not quite an elegant solution because first the two threads need to have the same starting time synchronized. So, if you need such guarantee, you need conditional variable to make sure another thread should wait.
https://en.wikipedia.org/wiki/Monitor_(synchronization)
In a word, use conditional variable if you need a sequencing effect and use locks/transactional memory, etc. to protect variable longer than quadword or not 64bit aligned.
Btw, here is an useful material for cache coherency if you are interested.
http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/10_coherence.pdf

Perl ithreads :shared variables - multiprocessor kernel threads - visibility

perlthrtut excerpt:
Note that a shared variable guarantees that if two or more threads try
to modify it at the same time, the internal state of the variable will
not become corrupted. However, there are no guarantees beyond this, as
explained in the next section.
Working on Linux supporting multiprocessor kernel threads.
Is there a guarantee that all threads will see the updated shared variable value ?
Consulting the perlthrtut doc as stated above there is no such guarantee.
Now the question: What can be done programmatically to guarantee that?

You ask
Is there a guarantee that all threads will see the updated shared variable value ?
Yes. :shared is that guarantee. The value will be safely and consistently and freshly updated.
The problem is simply that, without other synchronization, you don't know the order of these updates.
Consulting the perlthrtut doc as stated above there is no such guarantee.
You didn't read far enough. :)
The very next section in perlthrtut explains the kind of pitfalls you do face with perl threads: data races, which is to say, application logic races concerning shared data. Again, the shared data will be consistent and fresh and immune to corruption from (more-or-less) atomic perl opcodes. However, the high-level perl operations you perform on that shared data are not guaranteed to be atomic. $shared_var++, for instance, might be more than one atomic operation.
(If I may hazard a guess, you are perhaps thinking too much about other languages' lower level threading interfaces with their cache inconsistencies, torn words, reordered instructions, and lions and tigers and bears. Perl's model takes care of those low-level concerns for you.)

Using :shared on a variable causes all threads to reference it in the same physical memory address, so it doesn't matter which processor/core/hyper-thread they happen to be executing in. The perlthrtut talk of guarantees is in reference to race conditions, and in short, that you need to take into account that shared variables can be modified by any thread at any time. If this is a problem you'll need to make use of synchronization functions (e.g. lock() and cond_wait()) to control access.

You seem to be confused as to what :shared does. It makes it so a variable is shared by all threads.
A variable is indeed guaranteed to have the value it has, no matter which thread accesses it. It's a tautology, so nothing can be done to programmatically guarantee that.

Is it possible to share a register between threads?

I know that when the OS/Hardware switch between the execution of different threads it manage the store/restore the context of each thread, however I do not know many of the details. My question is: are there any register that I can use to share information between threads? In x86? mips? arm? etc,. linux? windows?
Any suggestion on how this can be done is highly apreciated.

There are some processor architectures where certain registers are not stored during context switch. From memory, 29K has some registers like that, which are essentially just global variables - gr112 .. gr115 from looking at the web. Now, this is a machine that has 192 physical registers, so it's not really a surprise it can afford sacrificing a few for this sort of purpose.
I know for a fact that x86 and x86-64 use "all registers", as does ARM. From what I can gather, MIPS also doesn't have any registers "reserved for the user". This applies to both Windows and Linux operating systems.
For any processor with a small number of registers (less or equal to 32), I would say that "wasting" registers are globals just to hold some value that some other thread/process may want to read is a waste of resource - generic code will run faster if that register is used as a general purpose register available for the compiler.

If you are writing all the code that will go in a system, you may dedicate registers to whatever purpose you want, subject to the limitation that any register which is dedicated to a particular function will be unusable for any other purpose. There are some very specialized situations where this may be worth doing; these generally entail, bizarre as it may seem, programs that are very simple but need to run very fast. Some compilers like gcc can facilitate such usage by allowing a programmer to specify particular registers that the code it generates should not use for any purpose unless explicitly requested. In general, because the efficiency of compiled code will be reduced by restricting the number of registers the compiler can use, it will be more efficient to simply use statically-defined memory locations to exchange information between threads. While memory locations cannot be accessed as quickly as registers, one can reserve many of them for various purposes without affecting the compiler's ability to optimize register usage.
The one situation I've seen on the ARM where using a dedicated register was helpful was a situation where a significant plurality of methods needed to share a common static data structure. Specifying that a certain register should always be assumed to hold a pointer to that data structure, and that code must never modify it, eliminates the need for code to load the address of that structure before accessing items therein. If you want to share information among threads, that might be a useful approach, since accessing an arbitrary static location generally requires a PC-relative load to fetch the address followed by a load of the actual data; having a dedicated register would eliminate one of the loads.

Your question seems reasonable at first glance. Other people have tried to answer the question directly. First we have two fairly nebulous concepts,
Threads
Registers
If you talk to Ada folks, they will freak out at the lack of definition of a linux or posix threads. They like something more like Java's green threads with very deterministic scheduling. I think you mean threads that are fast for the processor, like posix threads.
The 2nd issue is what is a register? To most people they are limited to 8,16 or 32 registers that are hard coded in the CPU's instruction set. There are often second class registers that can be accessed by other means. Mainly they are are amazingly fast.
The inverse
The inverse of your question is quite common. How to set a register to a different value for each thread. The general purpose registers are use by the compiler and the ABI of the compiler is intimately familiar to the OS context switch code. What may not be clear is that things like the upper bits of a stack register may be constant every time a thread runs; but are different for each thread. That is to say that each thread has its own stack.
With ARM Linux, a special co-processor register is used to implement thread local storage. The co-processor register is slower to access than a general purpose register, but it is still quite fast. That takes us to the difference between a process and a thread.
Endemic to threads
A process has a completely different memory layout. Ie, the mmu page tables switch for different processes. For a thread, the register set may be different, but all of regular memory is shared between threads. For this reason, there is lots of mutexes when you do thread programming.
Now, consider a CPU cache. It is ultra-fast memory just like a general purpose register. The only difference is the amount of instructions it takes to address it.
Answer
All of the OS's and CPUs already have this! Each thread shares memory and that memory is cached. Loading a global variable in two threads from cache is near as fast as register access. As the thread register you propose can only hold a pointer, you would need to de-reference it to access some larger entity. Loading a global variable will be nearly as fast and the compiler is free to put this in any register it likes. It is also possible for the compiler to use these registers in routines that don't need this access. So even, if there was an OS that reserved a general purpose register to be the same between threads, it would only be faster for a very small set of applications.

Multithreading and out-of-order execution

Assume we have a multithreaded C program (pthreads), and the (unsynchronized) shared variable accesses of the individual threads are not reordered by the compiler. Does an x86 CPU respect the order of the shared variable accesses (within a single thread), or is it possible that it reorders some memory accesses?

Unsynchronized shared variable accesses are dangerous, and out-of-order is one reason for it.
The x86 keeps writes in order (within a thread), but not reads.
This can get you into trouble, if you assume the order remains. For example:
Thread A writes to x and then to y. Assuming the compiler didn't reorder it, the cpu won't reorder it (x86 won't, others might).
thread B reads y and then x. You might think that if it got y's new value, then surely you'll get x's new value as well.
Not so. The CPU may reorder thread B's reads, so y will be actually read earlier.
EDIT: as "Man of One Way" pointed out, in this case, x86 (but not all processors!) guarantees ordering.
I quote the Intel software developer's manual:
Writes by a single processor are observed in the same order by all
processors.
This isn't true for writes by multiple processors - they may seem to be ordered differently by different processors.
However, I highly recommend not relying on it, and using use proper synchronization instead.
The synchronization primitives are implemented with atomic operations and/or barriers, which keep you safe.

Some reorderings are possible, due to the presence of a store buffer. See e.g. https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
However, the reorderings are seen only with several threads, within a single thread all accesses from that thread appear to happen in order.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string