Is it possible to share a register between threads? - linux

I know that when the OS/Hardware switch between the execution of different threads it manage the store/restore the context of each thread, however I do not know many of the details. My question is: are there any register that I can use to share information between threads? In x86? mips? arm? etc,. linux? windows?
Any suggestion on how this can be done is highly apreciated.

There are some processor architectures where certain registers are not stored during context switch. From memory, 29K has some registers like that, which are essentially just global variables - gr112 .. gr115 from looking at the web. Now, this is a machine that has 192 physical registers, so it's not really a surprise it can afford sacrificing a few for this sort of purpose.
I know for a fact that x86 and x86-64 use "all registers", as does ARM. From what I can gather, MIPS also doesn't have any registers "reserved for the user". This applies to both Windows and Linux operating systems.
For any processor with a small number of registers (less or equal to 32), I would say that "wasting" registers are globals just to hold some value that some other thread/process may want to read is a waste of resource - generic code will run faster if that register is used as a general purpose register available for the compiler.

If you are writing all the code that will go in a system, you may dedicate registers to whatever purpose you want, subject to the limitation that any register which is dedicated to a particular function will be unusable for any other purpose. There are some very specialized situations where this may be worth doing; these generally entail, bizarre as it may seem, programs that are very simple but need to run very fast. Some compilers like gcc can facilitate such usage by allowing a programmer to specify particular registers that the code it generates should not use for any purpose unless explicitly requested. In general, because the efficiency of compiled code will be reduced by restricting the number of registers the compiler can use, it will be more efficient to simply use statically-defined memory locations to exchange information between threads. While memory locations cannot be accessed as quickly as registers, one can reserve many of them for various purposes without affecting the compiler's ability to optimize register usage.
The one situation I've seen on the ARM where using a dedicated register was helpful was a situation where a significant plurality of methods needed to share a common static data structure. Specifying that a certain register should always be assumed to hold a pointer to that data structure, and that code must never modify it, eliminates the need for code to load the address of that structure before accessing items therein. If you want to share information among threads, that might be a useful approach, since accessing an arbitrary static location generally requires a PC-relative load to fetch the address followed by a load of the actual data; having a dedicated register would eliminate one of the loads.

Your question seems reasonable at first glance. Other people have tried to answer the question directly. First we have two fairly nebulous concepts,
Threads
Registers
If you talk to Ada folks, they will freak out at the lack of definition of a linux or posix threads. They like something more like Java's green threads with very deterministic scheduling. I think you mean threads that are fast for the processor, like posix threads.
The 2nd issue is what is a register? To most people they are limited to 8,16 or 32 registers that are hard coded in the CPU's instruction set. There are often second class registers that can be accessed by other means. Mainly they are are amazingly fast.
The inverse
The inverse of your question is quite common. How to set a register to a different value for each thread. The general purpose registers are use by the compiler and the ABI of the compiler is intimately familiar to the OS context switch code. What may not be clear is that things like the upper bits of a stack register may be constant every time a thread runs; but are different for each thread. That is to say that each thread has its own stack.
With ARM Linux, a special co-processor register is used to implement thread local storage. The co-processor register is slower to access than a general purpose register, but it is still quite fast. That takes us to the difference between a process and a thread.
Endemic to threads
A process has a completely different memory layout. Ie, the mmu page tables switch for different processes. For a thread, the register set may be different, but all of regular memory is shared between threads. For this reason, there is lots of mutexes when you do thread programming.
Now, consider a CPU cache. It is ultra-fast memory just like a general purpose register. The only difference is the amount of instructions it takes to address it.
Answer
All of the OS's and CPUs already have this! Each thread shares memory and that memory is cached. Loading a global variable in two threads from cache is near as fast as register access. As the thread register you propose can only hold a pointer, you would need to de-reference it to access some larger entity. Loading a global variable will be nearly as fast and the compiler is free to put this in any register it likes. It is also possible for the compiler to use these registers in routines that don't need this access. So even, if there was an OS that reserved a general purpose register to be the same between threads, it would only be faster for a very small set of applications.

Related

Multi-thread context switching in register-less machine

As I know, multi-thread context switching is pre-emptive initiated by the OS, transparent from the perspective of thread. Generally, when context switching, OS saves all the register values and restores them later switching back to that thread. This includes stack-pointer of that thread.
But consider a hypothetical register-less machine. In this, I can use a fixed address in memory to store the stack-pointer. But here, when the context-switches, stack-pointer is not guaranteed to be preserved. The other thread stores its stack-pointer in the same address. Since it is a global variable, it will corrupt stack-pointer of all other process in the same thread. How to avoid this? How can I store stack-pointers without necessarily needing registers, but also keeping stack-pointer valid after a context-switch?
I'm asking this because, since every computer is equivalent to basic Turing machine, and basic Turing machine does not contain registers this should be somehow doable. I thought about it for some time now, and I was unable to come up with anything.
EDIT: As #JérômeRichard mentioned in comments, all modern processors have registers, and their ISAs are dependent on them. So even memory copying is not possible in them without registers. So here I am going to define a simple architecture for argument purposes.
It is a machine with 2^x addressable units, with each of them having a unique x-bit address. For simplicity, assume there is no concept of virtual memory, and entire address space is mapped directly to physical memory. So, there is no need for requesting memory, process can use its entire x-bit address space freely. Let OS be present outside this address-space and not interfere with user-process.
Also, there is only one user-process running on the machine. But it can multiple threads. All threads share same address space. But they all can independently execute different instructions and on different data. Again, for simplicity, switching of instruction pointer is managed by the OS.
This processor's ISA is capable of reading, writing, copying data on any part of memory given their address. It can also do any arithmetic, logic, bit-manip operations on the data. Let me leave out all other floating-point and vector instructions.
When we have a single thread running, we can fix some global address and store the current function's data like frame pointer and return address and use them. Now, the challenge is to store and restore thread-specific data like these across context switches. Assume this OS does pre-emptive context switches.

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

Guaranteed CPU cache update after a certain time

Let's say I have a variable var located somewhere in memory and that an arbitrary number of processors/threads could read and modify it at any given time. But it's guaranteed that at least n seconds will have elapsed between a processor modifying var and any other one reading var. Is it possible to be certain that, if time in seconds is n, there's a value for n that guarantees that the processor reading var will read the updated value?
If your concern really is Cache coherence you should generally be safe 1.
Specifically, however, you may be not.
Cache coherence is usually handled by the hardware2 without the help of the software.
However this is very implementation specific: NUMA may be non cache-coherent, a Compute Shader may need specific built-in functions, IA32e and ARM generally hide cache coherence from the programmer.
To answer you question directly: No, you have no guarantees whatsoever.
The point is that cache coherence is something you deal with in clustered and parallel non uniform architectures.
While in this situations the programming model is inherently multi-threading, the two concepts3 are separated and what really should bug you is how to properly handle multi-threading, specifically synchronization and memory order.
Your question seems to suggest a simple case, where the readers are executed long after the writer is done.
If this property is really enforced you don't need any synchronization nor memory barrier. Beware however that sleep functions don't qualify as a valid enforcement.
If you instead need to synchronize (and so to order the memory accesses) then you need to use language specific constructs, for example volatile in C# and Java, atomics in C and C++ or specific instructions in assembly.
You may need to implement Critical sections too.
If you actually need to manually control the cache coherence for your architecture, than you have to check the specifications of interest (usually datasheets and formal papers) because there is no uniform way to deal with it and the compiler should provide some intrinsic or the runtime should provide a library.
So to add something to the direct answer above: No, you have no guarantees whatsoever, but when an usual CPU, in an usual architecture, need that data, it will be able to use the most updated one anyway. So you don't need to worry about that aspect.
Please note the use of the words common and that
1 For example if you use an Intel/AMD/ARM CPU, don't even think about cache coherence.
2 Either the CPU itself, a local monitor, a system monitor or a specific device.
3 Multi-threading and cache-coherence.
The cache will tend to get flushed on operating system tick interrupts when it goes into the scheduler to see if there's a different task to run.
However, as operating systems get smarter with things such as tickless NoHz and as CPU core counts go up, this gets less and less likely, and you shouldn't count on it.
Supercomputer clusters may not task switch for minutes at a time because they're using customized operating system code that doesn't interrupt the running jobs, ever. Compute jobs are assigned to a core from 1-7 with no interrupts and all of the other work runs on core-0.
There are two concepts mixed in you question: software synchronization and hardware coherency. Hardware coherency is talked by Margaret already so I won't cover it here.
Software Synchronization
x86 provides guarantee that quadword access would be carried out atomically if aligned on 64-bit boundary. But this guarantees that other processor won't read partial result (e.g. [32bit New]<32bit Old> weird mixture). It does not guarantee a hard time deadline before which another processor would see the newly assigned value. Let another thread wait for some time is not quite an elegant solution because first the two threads need to have the same starting time synchronized. So, if you need such guarantee, you need conditional variable to make sure another thread should wait.
https://en.wikipedia.org/wiki/Monitor_(synchronization)
In a word, use conditional variable if you need a sequencing effect and use locks/transactional memory, etc. to protect variable longer than quadword or not 64bit aligned.
Btw, here is an useful material for cache coherency if you are interested.
http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/10_coherence.pdf

Usage of registers by the compiler in multithreaded program

It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!
Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.
'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.
In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...

Is it possible to create threads without system calls in Linux x86 GAS assembly?

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.
The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.
My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...
my question is therefore twofold (with an extra bonus question underneath it):
Is it possible to write assembler
code which can run multiple threads
simultaneously on multiple cores at
once, without the need of system
calls?
Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?
My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?
The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.
User threads won't give you the threading support that you want, because they all run on the same hardware thread.
Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.
Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.
"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".
The short answer is you can do multithreaded programming without
calling expensive OS task management primitives. Simply ignore the OS for thread
scheduling operations. This means you have to write your own thread
scheduler, and simply never pass control back to the OS.
(And you have to be cleverer somehow about your thread overhead
than the pretty smart OS guys).
We chose this approach precisely because windows process/thread/
fiber calls were all too expensive to support computation
grains of a few hundred instructions.
Our PARLANSE programming langauge is a parallel programming language:
See http://www.semdesigns.com/Products/Parlanse/index.html
PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism
construct, and schedules such grains by a combination of a highly
tuned hand-written scheduler and scheduling code generated by the
PARLANSE compiler that takes into account the context of grain
to minimimze scheduling overhead. For instance, the compiler
ensures that the registers of a grain contain no information at the point
where scheduling (e.g., "wait") might be required, and thus
the scheduler code only has to save the PC and SP. In fact,
quite often the scheduler code doesnt get control at all;
a forked grain simply stores the forking PC and SP,
switches to compiler-preallocated stack and jumps to the grain
code. Completion of the grain will restart the forker.
Normally there's an interlock to synchronize grains, implemented
by the compiler using native LOCK DEC instructions that implement
what amounts to counting semaphores. Applications
can fork logically millions of grains; the scheduler limits
parent grains from generating more work if the work queues
are long enough so more work won't be helpful. The scheduler
implements work-stealing to allow work-starved CPUs to grab
ready grains form neighboring CPU work queues. This has
been implemented to handle up to 32 CPUs; but we're a bit worried
that the x86 vendors may actually swamp use with more than
that in the next few years!
PARLANSE is a mature langauge; we've been using it since 1997,
and have implemented a several-million line parallel application in it.
Implement user-mode threading.
Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.
You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.
If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.
System calls have gotten faster. Modern CPUs have support for the sysenter instruction, which is significantly faster than the old int instruction. See also this article for how Linux does system calls in the fastest way possible.
Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherency problem.
Quite a bit late now, but I was interested in this kind of topic myself.
In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.
Obligatory BLUF:
Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.
Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.
Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.
Onto my 2-cents worth of content.
I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).
This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.
To implement this system (entirely in userspace and also via non-root access if desired) the following were required:
A notion of what threads boil down to:
A stack for stack operations (kinda self explaining and obvious)
A set of instructions to execute (also obvious)
A small block of memory to hold individual register contents
What a scheduler boils down to:
A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).
A thread context switcher:
A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.
So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.
I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.
System calls are not that slow now, with syscall or sysenter instead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.
First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads.
Then you can simply call the C from assembly code.
Here are some pointers:
Posix threads: link text
A tutorial where you will learn how to call C functions from assembly: link text
Butenhof's book on POSIX threads link text

Resources