Multi-thread context switching in register-less machine - multithreading

As I know, multi-thread context switching is pre-emptive initiated by the OS, transparent from the perspective of thread. Generally, when context switching, OS saves all the register values and restores them later switching back to that thread. This includes stack-pointer of that thread.
But consider a hypothetical register-less machine. In this, I can use a fixed address in memory to store the stack-pointer. But here, when the context-switches, stack-pointer is not guaranteed to be preserved. The other thread stores its stack-pointer in the same address. Since it is a global variable, it will corrupt stack-pointer of all other process in the same thread. How to avoid this? How can I store stack-pointers without necessarily needing registers, but also keeping stack-pointer valid after a context-switch?
I'm asking this because, since every computer is equivalent to basic Turing machine, and basic Turing machine does not contain registers this should be somehow doable. I thought about it for some time now, and I was unable to come up with anything.
EDIT: As #JérômeRichard mentioned in comments, all modern processors have registers, and their ISAs are dependent on them. So even memory copying is not possible in them without registers. So here I am going to define a simple architecture for argument purposes.
It is a machine with 2^x addressable units, with each of them having a unique x-bit address. For simplicity, assume there is no concept of virtual memory, and entire address space is mapped directly to physical memory. So, there is no need for requesting memory, process can use its entire x-bit address space freely. Let OS be present outside this address-space and not interfere with user-process.
Also, there is only one user-process running on the machine. But it can multiple threads. All threads share same address space. But they all can independently execute different instructions and on different data. Again, for simplicity, switching of instruction pointer is managed by the OS.
This processor's ISA is capable of reading, writing, copying data on any part of memory given their address. It can also do any arithmetic, logic, bit-manip operations on the data. Let me leave out all other floating-point and vector instructions.
When we have a single thread running, we can fix some global address and store the current function's data like frame pointer and return address and use them. Now, the challenge is to store and restore thread-specific data like these across context switches. Assume this OS does pre-emptive context switches.

Related

Does a variable only read by one thread, read and written by another, need synchronization?

Motive:
I am just learning the fundamentals of multithreading, not close to finishing them, but I'd like to ask a question this early in my learning journey to guide me toward the topics most relevant to my project I 'm working on.
Main:
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
a. If a process has two threads, one that edits a set of variables, the other only reads said variables and never edits their values; Then do we need any sort of synchronization for guaranteeing the validity of the read values by the reading thread?
In general, yes. Otherwise, the thread editing the value could change the value only locally so that the other thread will never see the value change. This can happens because of compilers (that could use registers to read/store variables) but also because of the hardware (regarding the cache coherence mechanism used on the target platform). Generally, locks, atomic variables and memory barriers are used to perform such synchronizations.
b. Is it possible for the OS scheduling these two threads to cause the reading-thread to read a variable in a memory location in the exact same moment while the writing-thread is writing into the same memory location, or that's just a hardware/bus situation will never be allowed happen and a software designer should never care about that? What if the variable is a large struct instead of a little int or char?
In general, there is no guarantee that accesses are done atomically. Theoretically, two cores executing each one a thread can load/store the same variable at the same time (but often not in practice). It is very dependent of the target platform.
For processor having (coherent) caches (ie. all modern mainstream processors) cache lines (ie. chunks of typically 64 or 128 bytes) have a huge impact on the implicit synchronization between threads. This is a complex topic, but you can first read more about cache coherence in order to understand how the memory hierarchy works on modern platforms.
The cache coherence protocol prevent two load/store being done exactly at the same time in the same cache line. If the variable cross multiple cache lines, then there is no protection.
On widespread x86/x86-64 platforms, variables having primitive types of <= 8 bytes can be modified atomically (because the bus support that as well as the DRAM and the cache) assuming the address is correctly aligned (it does not cross cache lines). However, this does not means all such accesses are atomic. You need to specify this to the compiler/interpreter/etc. so it produces/executes the correct instructions. Note that there is also an extension for 16-bytes atomics. There is also an instruction set extension for the support of transactional memory. For wider types (or possibly composite ones) you likely need a lock or an atomic state to control the atomicity of the access to the target variable.

What happens in two execution units, if two threads within the same process managed by the OS accessing in the same time to the same virtual address?

What happens in computing system with two execution units such as Core2 Duo, if two threads within the same process managed by the OS(Kernel Level Thread) during runtime accessing the same virtual memory address?
Hi, I'm trying to understand the differences between User Level thread and Kernel Level thread and accessing virtual memory address? or physical memory address?...
I don't know if those programs won't crash and both will give expected results or if those programs won't crash but both will give unexpected results or maybe those programs can crash.
What should guide me?
Thanks.
From what I can make out from your question, you are mixing different concepts, namely: Multiple execution units and Race conditions, Virtual memory, and User level vs Kernel level threads
What happens in computing system with two execution units such as Core2 Duo, if two threads within the same process managed by the OS(Kernel Level Thread) during runtime accessing the same virtual memory
Well this is usually always the case, n cores/processors does not matter. This is a basic concept of multi-threading. So What happens is same as in what will happen when multiple threads access a shared resource. The usual race condition will always need to be addressed by the developer.
Now don't mix Virtual memory with this. User/Kernel level threads both will simply access the memory address in Virtual addressing mode only. This happens because once Virtual memory is enabled in protected mode, its on to the processor to do the translations from Virtual addresses to Physical addresses implicitly (using the Page tables/etc which OS has set up).
What happens in computing system with two execution units such as Core2 Duo, if two threads within the same process managed by the OS(Kernel Level Thread) during runtime accessing the same virtual memory?
One of two things happens:
If the author of the software running in the execution units followed the platform's rules for such accesses, they get whatever behavior the platform specifies for such accesses.
If the author of the software running in the execution units did not follow the platform's rules for such accesses, the results are often unpredictable or undefined.
A typical platform rule might be that an object may not be accessed in one thread or execution unit while another thread or execution unit is, or might be, modifying it. Typical modern platforms have no issue with simultaneous reads.
There are two cases:
The memory page frame is already mapped and have compatible access permissions. In this case both threads access the memory.
The memory page frame is not mapped or have different access permissions. In this case the accessing thread generates a hardware interrupt and the kernel handles that interrupt by mapping the page, or generating signal SIGSEGV for the thread if a page cannot be mapped. When another thread accesses the same page frame while it is being mapped, the thread blocks on a synchronization primitive in the kernel until the mapping operation is complete (in Linux that is rw_semaphore mmap_sem member of struct mm_struct). In other words, the OS kernel protects its data structures from race conditions.
The short answer to "what happens during concurrent memory access" is that, it's complicated.
Probably the most canonical report on the subject is Memory Consistency Models for Shared-Memory Multiprocessors, by Kourosh Gharachorloo. The first paragraph of the intro may seem slightly dated by now, but he provides enough background information to make the report quite readable to non-experts, and there haven't really been any fundamental changes since then.
The Core 2 Duo follows the Processor Consistency model (as every other x86 CPU), so without any explicit synchronization all processors will effectively agree on the order of writes to any given memory location, but when reading from different locations they could observe the writes in different orders.
Core2 Duo implements cache coherence via the MESI protocol.
Keep in mind that since x86 allows unaligned access, what looks like a single memory access in a single instruction could actually be two memory accesses to different locations if it happens to cross a cache-line boundary. If you're programming in a higher-level language (higher than assembly), the language may impose it's own semantics on concurrent memory accesses (and usually does).
The question What happens... is touching different levels of abstraction (user land to hardware).
Actually What happens is addressing hardware level:
There are two concurrent (basically occurring at the same machine cycle) accesses to the same physical memory address.
There the following 3 possibilities arise:
(Given a memory address X with a content of C(X) being x0.)
a) two reads
this will have a deterministic outcome: both will observe x0 as value.
b) one read, one write (value: x1)
the reader may observe x0 or x1, depending on ordering of the accesses (see below).
c) two writes (values: x1, x2)
the final content C(X) may then end up as x1 or x2, depending on ordering (see below)
In cases b) and c) an observer may get the impression of non-deterministic behavior. However, the underlying behavior actually is still deterministic.
The actual outcome of those concurrent (at the same time) accesses is determined by the details of the hardware. Mostly:
bus allocation strategy
cache coherence strategy
The actual OS being used will not have significant influence on this behavior.
The question already requires shared memory among the (big level) execution units (threads in the question), but processes (separate memory be default) using shared memory segments will behave similar.
Any use of synchronisation mechanisms (locks, semaphores) will just prevent concurrent accesses to occur and this way eliminate concurrent access from occurring.
You may think of atomic operations (usually doing an exclusive read-modify-write cycle on the physical memory) as this will allow the executing execution unit to some ind of "know" whether any "other" operation did occur and such cause a more deterministic ordering of operations.

Is it possible to share a register between threads?

I know that when the OS/Hardware switch between the execution of different threads it manage the store/restore the context of each thread, however I do not know many of the details. My question is: are there any register that I can use to share information between threads? In x86? mips? arm? etc,. linux? windows?
Any suggestion on how this can be done is highly apreciated.
There are some processor architectures where certain registers are not stored during context switch. From memory, 29K has some registers like that, which are essentially just global variables - gr112 .. gr115 from looking at the web. Now, this is a machine that has 192 physical registers, so it's not really a surprise it can afford sacrificing a few for this sort of purpose.
I know for a fact that x86 and x86-64 use "all registers", as does ARM. From what I can gather, MIPS also doesn't have any registers "reserved for the user". This applies to both Windows and Linux operating systems.
For any processor with a small number of registers (less or equal to 32), I would say that "wasting" registers are globals just to hold some value that some other thread/process may want to read is a waste of resource - generic code will run faster if that register is used as a general purpose register available for the compiler.
If you are writing all the code that will go in a system, you may dedicate registers to whatever purpose you want, subject to the limitation that any register which is dedicated to a particular function will be unusable for any other purpose. There are some very specialized situations where this may be worth doing; these generally entail, bizarre as it may seem, programs that are very simple but need to run very fast. Some compilers like gcc can facilitate such usage by allowing a programmer to specify particular registers that the code it generates should not use for any purpose unless explicitly requested. In general, because the efficiency of compiled code will be reduced by restricting the number of registers the compiler can use, it will be more efficient to simply use statically-defined memory locations to exchange information between threads. While memory locations cannot be accessed as quickly as registers, one can reserve many of them for various purposes without affecting the compiler's ability to optimize register usage.
The one situation I've seen on the ARM where using a dedicated register was helpful was a situation where a significant plurality of methods needed to share a common static data structure. Specifying that a certain register should always be assumed to hold a pointer to that data structure, and that code must never modify it, eliminates the need for code to load the address of that structure before accessing items therein. If you want to share information among threads, that might be a useful approach, since accessing an arbitrary static location generally requires a PC-relative load to fetch the address followed by a load of the actual data; having a dedicated register would eliminate one of the loads.
Your question seems reasonable at first glance. Other people have tried to answer the question directly. First we have two fairly nebulous concepts,
Threads
Registers
If you talk to Ada folks, they will freak out at the lack of definition of a linux or posix threads. They like something more like Java's green threads with very deterministic scheduling. I think you mean threads that are fast for the processor, like posix threads.
The 2nd issue is what is a register? To most people they are limited to 8,16 or 32 registers that are hard coded in the CPU's instruction set. There are often second class registers that can be accessed by other means. Mainly they are are amazingly fast.
The inverse
The inverse of your question is quite common. How to set a register to a different value for each thread. The general purpose registers are use by the compiler and the ABI of the compiler is intimately familiar to the OS context switch code. What may not be clear is that things like the upper bits of a stack register may be constant every time a thread runs; but are different for each thread. That is to say that each thread has its own stack.
With ARM Linux, a special co-processor register is used to implement thread local storage. The co-processor register is slower to access than a general purpose register, but it is still quite fast. That takes us to the difference between a process and a thread.
Endemic to threads
A process has a completely different memory layout. Ie, the mmu page tables switch for different processes. For a thread, the register set may be different, but all of regular memory is shared between threads. For this reason, there is lots of mutexes when you do thread programming.
Now, consider a CPU cache. It is ultra-fast memory just like a general purpose register. The only difference is the amount of instructions it takes to address it.
Answer
All of the OS's and CPUs already have this! Each thread shares memory and that memory is cached. Loading a global variable in two threads from cache is near as fast as register access. As the thread register you propose can only hold a pointer, you would need to de-reference it to access some larger entity. Loading a global variable will be nearly as fast and the compiler is free to put this in any register it likes. It is also possible for the compiler to use these registers in routines that don't need this access. So even, if there was an OS that reserved a general purpose register to be the same between threads, it would only be faster for a very small set of applications.

Usage of registers by the compiler in multithreaded program

It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!
Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.
'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.
In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...

Memory addressing in assembly / multitasking

I understand how programs in machine code can load values from memory in to registers, perform jumps, or store values in registers to memory, but I don't understand how this works for multiple processes. A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I have another question regarding multitasking that is somewhat related. How does the OS, which isn't running, stop a thread and move on to the next. Is this done with timed interrupts? If so, then how can the values in registers be preserved for a thread. Are they saved to memory before control is given to a different thread? Or, rather than timed interrupts, does the thread simply choose a good time to give up control. In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Edit: Or are executables edited before being run to compensate for the correct offsets?
That's not how it works. All modern operating systems virtualize the available memory. Giving every process the illusion that it has 2 gigabytes of memory (or more) and doesn't have to share it with anybody. The key component in a machine that does this is the MMU, nowadays built in the processor itself. Another core feature of this virtualization is that it isolates processes. One misbehaving one cannot bring another one down with it.
Yes, a clock tick interrupt is used to interrupt the currently running code. Processor state is simply saved on the stack. The operating system scheduler then checks if any other thread is ready to run and has a high enough priority to get first in line. Some extra code ensures that everybody gets a fair share. Then it just a matter of setting the MMU to resume execution on the other thread. If no thread is ready to run then the CPU gets physically turned off with the HALT instruction. To be woken again by the next clock interrupt.
This is ten-thousand foot view, it is well covered in any book about operating system design.
A process is allocated memory on the fly, so must it use relative addressing?
No, it can use relative or absolute addressing depending on what it is trying to address.
At least historically, the various different addressing modes were more about local versus remote memory. Relative addressing was for memory addresses close to the current address while absolute was more expensive but could address anything. With modern virtual memory systems, these distinctions may be no longer necessary.
A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I'm not sure about this one. This is taken care of by the compiler normally. Again, modern virtual memory systems make make this complexity unnecessary.
Are they saved to memory before control is given to a different thread?
Yes. Typically all of the state (registers, etc.) is stored in a process control block (PCB), a new context is loaded, the registers and other context is loaded from the new PCB, and execution begins in the new context. The PCB can be stored on the stack or in kernel memory or in can utilize processor specific operations to optimize this process.
Or, rather than timed interrupts, does the thread simply choose a good time to give up control.
The thread can yield control -- put itself back at the end of the run queue. It can also wait for some IO or sleep. Thread libraries then put the thread in wait queues and switch to another context. When the IO is ready or the sleep expires, the thread is put back into the run queue. The same happens with mutex locks. It waits for the lock in a wait queue. Once the lock is available, the thread is put back into the run queue.
In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Either the thread can run (perform CPU instructions) or it is waiting -- either on IO or a sleep. It can ask to yield but typically it is doing so by [again] sleeping or waiting on IO.
I probably walked into this question quite late, but then, it may be of use to some other programmers. First - the theory.
The modern day operating system will virtualize the memory, and to do so, it maintains, within its system memory area, a series of page pointers. Each page is of a fixed size (usually 4K), and when any program seeks some memory, its allocated memory addresses that are virtualized using the memory page pointer. Its approximates the behaviour of "segment" registers in the prior generation of the processors.
Now when the scheduler decides to get another process running, it may or may not keep the previous process in memory. If it keeps it in memory, then all that the scheduler does is to save the entire register snapshot (now, including YMM registers - this bit was a complex issue earlier as there are no single instructions that saved the entire context : read up on XSAVE), and this has a fixed format (available in Intel SW manual). This is stored in the memory space of the scheduler itself, along with the information on the memory pages that were being used.
If however, the scheduler needs to "dump" the current process context that is about to go to sleep to the hard disk - this situation usually arises when the process that is waking up needs extraordinary amount of memory, then the scheduler writes the memory page files in the disk blocks (called pagefile - reserved area of memory - also the source of "old grandmother wisdom" that pagefile must be equal to size of real memory) and the scheduler preserves the memory page pointer addresses as offsets in the pagefile. When it wakes up, the scheduler reads from pagefile the offset address, allocates real memory and populates the memory page pointers, and then loads the contents from the disk blocks.
Now, to answer your specific questions :
1. Do u need to use only relative addressing, or you can use absolute?
And. You may use either - whatever u perceive to be as absolute is also relative as the memory page pointer relativizes that address in an invisible format. There is no really absolute memory address anywhere (including the io device memories) except the kernel of the operating system itself. To test this, u may unassemble any .EXE program, to see that the entry point is always CALL 0010 which clearly implies that each thread gets a different "0010" to start the execution.
How do threads get life and what if it surrenders the unused slice.
Ans. The threads usually get a slice - modern systems have 20ms as the usual standard - but this is sometimes changed in special purpose compilation for servers that do not have many hardware interrupts to deal with - in order of their position on the process queue. A thread usually surrenders its slice by calling function sleep(), which is a formal (and very nice way) to surrender your balance part of the time slice. Most libraries implementing asynchronous reads, or interrupt actions, call sleep() internally, but in many instances, top level programs also call sleep() - e.g. to create a time gap. An invocation to sleep will certainly change the process context - the CPU actually is not given the liberty to sleep using NOP.
The other method is to wait for an IO to complete, and this is handled differently. The program on asking for an IO process, will cede its time slice, and the process scheduler flags this thread to be in "WAITING FOR AN IO" state - and this thread will not be given a time slice by the processor till its intended IO is completed, or timed out. This feature helps programmers as they do not have to explicitly write a sleep_until_IO() kind of interface.
Trust this sets you going further in your explorations.

Resources