Can gcc's thread-local storage share cache lines across threads?

Can gcc's thread-local storage share cache lines across threads? - linux

If a variable or constant-sized array is declared with __thread, can the backing virtual address range share a cache line across threads? (For example, if two copies of a thread-local integer land on the same cache line, performance will suffer because of cache line bouncing.) Does the answer depend on gcc/Linux version and hardware architecture?

According to Ultrich Drepper who is an infamous expert former glibc maintainer, "not allocated in the normal data segment; instead each thread has its own separate area where such variables are stored. The variables can have static initializers. All thread-local variables are addressable by all other threads but, unless a thread passes a pointer to a thread-local variable to those other threads, there is no way the other threads can find that variable. Due to the variable being thread-local, false sharing is not a problem—unless the program artificially creates a problem."
If you study, Memory part 6: More things programmers can do thread local variables, if a pointer is passed, could be accessing the same cache line.
To avoid this issue, simply don't pass the addresses in question around. As specific technique of grouping frequent rw variables together to share a cache line in a struct is described, with padding to AVOID multiple threads writing to same cache line when NOT using TLS, so long as the __thread variable pointers are not used by other threads, cache line bouncing ought be avoided.
Hopefully the linker implementers know the CPU architecture, you can't choose where in virtual & physical memory address space the multiplied thread local storage is allocated, it ought be safely assumed they plus CPU designers do a reasonable job of avoiding performance problems due to false clashes. You'ld have the same issues happening by accident between seperate processes, nevermind threads if cache associativity bounces was a common problem.

Related

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?

a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Is it possible to share a register between threads?

I know that when the OS/Hardware switch between the execution of different threads it manage the store/restore the context of each thread, however I do not know many of the details. My question is: are there any register that I can use to share information between threads? In x86? mips? arm? etc,. linux? windows?
Any suggestion on how this can be done is highly apreciated.

There are some processor architectures where certain registers are not stored during context switch. From memory, 29K has some registers like that, which are essentially just global variables - gr112 .. gr115 from looking at the web. Now, this is a machine that has 192 physical registers, so it's not really a surprise it can afford sacrificing a few for this sort of purpose.
I know for a fact that x86 and x86-64 use "all registers", as does ARM. From what I can gather, MIPS also doesn't have any registers "reserved for the user". This applies to both Windows and Linux operating systems.
For any processor with a small number of registers (less or equal to 32), I would say that "wasting" registers are globals just to hold some value that some other thread/process may want to read is a waste of resource - generic code will run faster if that register is used as a general purpose register available for the compiler.

If you are writing all the code that will go in a system, you may dedicate registers to whatever purpose you want, subject to the limitation that any register which is dedicated to a particular function will be unusable for any other purpose. There are some very specialized situations where this may be worth doing; these generally entail, bizarre as it may seem, programs that are very simple but need to run very fast. Some compilers like gcc can facilitate such usage by allowing a programmer to specify particular registers that the code it generates should not use for any purpose unless explicitly requested. In general, because the efficiency of compiled code will be reduced by restricting the number of registers the compiler can use, it will be more efficient to simply use statically-defined memory locations to exchange information between threads. While memory locations cannot be accessed as quickly as registers, one can reserve many of them for various purposes without affecting the compiler's ability to optimize register usage.
The one situation I've seen on the ARM where using a dedicated register was helpful was a situation where a significant plurality of methods needed to share a common static data structure. Specifying that a certain register should always be assumed to hold a pointer to that data structure, and that code must never modify it, eliminates the need for code to load the address of that structure before accessing items therein. If you want to share information among threads, that might be a useful approach, since accessing an arbitrary static location generally requires a PC-relative load to fetch the address followed by a load of the actual data; having a dedicated register would eliminate one of the loads.

Your question seems reasonable at first glance. Other people have tried to answer the question directly. First we have two fairly nebulous concepts,
Threads
Registers
If you talk to Ada folks, they will freak out at the lack of definition of a linux or posix threads. They like something more like Java's green threads with very deterministic scheduling. I think you mean threads that are fast for the processor, like posix threads.
The 2nd issue is what is a register? To most people they are limited to 8,16 or 32 registers that are hard coded in the CPU's instruction set. There are often second class registers that can be accessed by other means. Mainly they are are amazingly fast.
The inverse
The inverse of your question is quite common. How to set a register to a different value for each thread. The general purpose registers are use by the compiler and the ABI of the compiler is intimately familiar to the OS context switch code. What may not be clear is that things like the upper bits of a stack register may be constant every time a thread runs; but are different for each thread. That is to say that each thread has its own stack.
With ARM Linux, a special co-processor register is used to implement thread local storage. The co-processor register is slower to access than a general purpose register, but it is still quite fast. That takes us to the difference between a process and a thread.
Endemic to threads
A process has a completely different memory layout. Ie, the mmu page tables switch for different processes. For a thread, the register set may be different, but all of regular memory is shared between threads. For this reason, there is lots of mutexes when you do thread programming.
Now, consider a CPU cache. It is ultra-fast memory just like a general purpose register. The only difference is the amount of instructions it takes to address it.
Answer
All of the OS's and CPUs already have this! Each thread shares memory and that memory is cached. Loading a global variable in two threads from cache is near as fast as register access. As the thread register you propose can only hold a pointer, you would need to de-reference it to access some larger entity. Loading a global variable will be nearly as fast and the compiler is free to put this in any register it likes. It is also possible for the compiler to use these registers in routines that don't need this access. So even, if there was an OS that reserved a general purpose register to be the same between threads, it would only be faster for a very small set of applications.

Usage of registers by the compiler in multithreaded program

It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!

Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.

'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.

In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...

Process VS thread : can two processes share the same shared memory ? can two threads ?

After thinking about the the whole concept of shared memory , a question came up:
can two processes share the same shared memory segment ? can two threads share the same shared memory ?
After thinking about it a little more clearly , I'm almost positive that two processes can share the same shared memory segment , where the first is the father and the second is the son , that was created with a fork() , but what about two threads ?
Thanks

can two processes share the same shared memory segment?
Yes and no. Typically with modern operating systems, when another process is forked from the first, they share the same memory space with a copy-on-write set on all pages. Any updates made to any of the read-write memory pages causes a copy to be made for the page so there will be two copies and the memory page will no longer be shared between the parent and child process. This means that only read-only pages or pages that have not been written to will be shared.
If a process has not been forked from another then they typically do not share any memory. One exception is if you are running two instances of the same program then they may share code and maybe even static data segments but no other pages will be shared. Another is how some operating systems allow applications to share the code pages for dynamic libraries that are loaded by multiple applications.
There are also specific memory-map calls to share the same memory segment. The call designates whether the map is read-only or read-write. How to do this is very OS dependent.
can two threads share the same shared memory?
Certainly. Typically all of the memory inside of a multi-threaded process is "shared" by all of the threads except for some relatively small stack spaces which are per-thread. That is usually the definition of threads in that they all are running within the same memory space.
Threads also have the added complexity of having cached memory segments in high speed memory tied to the processor/core. This cached memory is not shared and updates to memory pages are flushed into central storage depending on synchronization operations.

In general, a major point of processes is to prevent memory being shared! Inter-process comms via a shared memory segment is certainly possible on the most common OS, but the mechanisms are not there by default. Failing to set up, and manage, the shared area correctly will likely result in a segFault/AV if you're lucky and UB if not.
Threads belonging to the same process, however, do not have such hardware memory-management protection can pretty much share whatever they like, the obvious downside being that they can corrupt pretty much whatever they like. I've never actually found this to be a huge problem, esp. with modern OO languages that tend to 'structure' pointers as object instances, (Java, C#, Delphi).

Yes, two processes can both attach to a shared memory segment. A shared memory segment wouldn't be much use if that were not true, as that is the basic idea behind a shared memory segment - that's why it's one of several forms of IPC (inter-Process communication).
Two threads in the same process could also both attach to a shared memory segment, but given that they already share the entire address space of the process they are part of, there likely isn't much point (although someone will probably see that as a challenge to come up with a more-or-less valid use case for doing so).

In general terms, each process occupies a memory space isolated from all others in order to avoid unwanted interactions (including those which would represent security issues). However, there is usually a means for processes to share portions of memory. Sometimes this is done to reduce RAM footprint ("installed files" in VAX/VMS is/was one such example). It can also be a very efficient way for co-operating processes to communicate. How that sharing is implemented/structured/managed (e.g. parent/child) depends on the features provided by the specific operating system and design choices implemented in the application code.
Within a process, each thread has access to exactly the same memory space as all other threads of the same process. The only thing a thread has unique to itself is "execution context", part of which is its stack (although nothing prevents one thread from accessing or manipulating the stack "belonging to" another thread of the same process).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string