Inspired by this question: In Complexity Analysis why is ++ considered to be 2 operations?
Take the following psuedo code:
class test
{
int _counter;
void Increment()
{
_counter++;
}
}
Would this be considered thread safe on an x86 architechure? Further more are the Inc / Dec assembly instructions thread safe?
No, incrementing is not thread-safe. Neither are the INC and DEC instructions. They all require a load and a store, and a thread running on another CPU could do its own load or store on the same memory location interleaved between those operations.
Some languages have built-in support for thread synchronization, but it's usually something you have to ask for, not something you get automatically on every variable. Those that don't have built-in support usually have access to a library that provides similar functionality.
In a word, no.
You can use something like InterlockedIncrement() depending on your platform. On .NET you can use the Interlocked class methods (Interlocked.Increment() for example).
A Rob Kennedy mentioned, even if the operation is implemented in terms of a single INC instruction, as far as the memory is concerned a read/increment/write set of steps is performed. There is the opportunity on a multi-processor system for corruption.
There's also the volatile issue, which would be a necessary part of making the operation thread-safe - however, marking the variable volatile is not sufficient to make it thread-safe. Use the interlocked support the platform provides.
This is true in general, and on x86/x64 platforms certainly.
Related
When using atomics in Go (and other languages like c++) its advised to use an atomic load operation for reading a concurrently written value.
If the definition (as I understand it) of an atomic write (be it a store or an integer increment) is that no thread can view a partial write, why is an atomic load required?
Would a plain load of the memory address always be safe from a torn view, if only atomic stores are used on that memory address?
This answer is mainly for C and C++ as I am not directly familiar with atomics in many other languages, but I suspect they are similar.
It's true that many actual machines work this way, in some cases. For instance, on x86-64, ordinary load instructions are atomic with respect to ordinary stores or locked read-modify-write instructions. So for types that can be loaded with a single instruction, you could in principle use ordinary assignment and avoid tearing.
But there are cases where this would not work. For instance:
Types which are not lock-free (e.g. structs of more than a couple words). In this case, several instructions are needed to load or store, and so a lock must be taken around them, or tearing is entirely possible. The atomic load function knows to take the lock, an ordinary assignment wouldn't.
Types which can be lock-free but need special handling. For example, 64-bit long long int on x86-32. An ordinary load would execute two 32-bit integer load instructions (which are individually atomic), and so even if the store is atomic, it could happen in between. But the atomic load function can emit a 64-bit floating point or SIMD load, which is less efficient but does it in one atomic instruction. Example on godbolt.
As such, the language promises atomicity only when the store and load both use the provided atomic functions. - your "definition" is not accurate for C or C++. By requiring the programmer to always use an atomic load, the language provides a "hook" where implementations can take appropriate action if needed. In cases where an ordinary load would suffice, the implementation can optimize accordingly and nothing is lost.
Another point is that the atomic load provides a place to put a memory barrier when one is wanted (any ordering except relaxed). Some architectures include load instructions with a built-in barrier (e.g. ARM64's ldar), and making the barrier part of the load at the language level makes it easier for the compiler to take advantage of this. If you had to do a regular assignment followed by a call to a barrier function, it would be harder for the compiler to figure out that it could optimize them into ldar.
I know that when the OS/Hardware switch between the execution of different threads it manage the store/restore the context of each thread, however I do not know many of the details. My question is: are there any register that I can use to share information between threads? In x86? mips? arm? etc,. linux? windows?
Any suggestion on how this can be done is highly apreciated.
There are some processor architectures where certain registers are not stored during context switch. From memory, 29K has some registers like that, which are essentially just global variables - gr112 .. gr115 from looking at the web. Now, this is a machine that has 192 physical registers, so it's not really a surprise it can afford sacrificing a few for this sort of purpose.
I know for a fact that x86 and x86-64 use "all registers", as does ARM. From what I can gather, MIPS also doesn't have any registers "reserved for the user". This applies to both Windows and Linux operating systems.
For any processor with a small number of registers (less or equal to 32), I would say that "wasting" registers are globals just to hold some value that some other thread/process may want to read is a waste of resource - generic code will run faster if that register is used as a general purpose register available for the compiler.
If you are writing all the code that will go in a system, you may dedicate registers to whatever purpose you want, subject to the limitation that any register which is dedicated to a particular function will be unusable for any other purpose. There are some very specialized situations where this may be worth doing; these generally entail, bizarre as it may seem, programs that are very simple but need to run very fast. Some compilers like gcc can facilitate such usage by allowing a programmer to specify particular registers that the code it generates should not use for any purpose unless explicitly requested. In general, because the efficiency of compiled code will be reduced by restricting the number of registers the compiler can use, it will be more efficient to simply use statically-defined memory locations to exchange information between threads. While memory locations cannot be accessed as quickly as registers, one can reserve many of them for various purposes without affecting the compiler's ability to optimize register usage.
The one situation I've seen on the ARM where using a dedicated register was helpful was a situation where a significant plurality of methods needed to share a common static data structure. Specifying that a certain register should always be assumed to hold a pointer to that data structure, and that code must never modify it, eliminates the need for code to load the address of that structure before accessing items therein. If you want to share information among threads, that might be a useful approach, since accessing an arbitrary static location generally requires a PC-relative load to fetch the address followed by a load of the actual data; having a dedicated register would eliminate one of the loads.
Your question seems reasonable at first glance. Other people have tried to answer the question directly. First we have two fairly nebulous concepts,
Threads
Registers
If you talk to Ada folks, they will freak out at the lack of definition of a linux or posix threads. They like something more like Java's green threads with very deterministic scheduling. I think you mean threads that are fast for the processor, like posix threads.
The 2nd issue is what is a register? To most people they are limited to 8,16 or 32 registers that are hard coded in the CPU's instruction set. There are often second class registers that can be accessed by other means. Mainly they are are amazingly fast.
The inverse
The inverse of your question is quite common. How to set a register to a different value for each thread. The general purpose registers are use by the compiler and the ABI of the compiler is intimately familiar to the OS context switch code. What may not be clear is that things like the upper bits of a stack register may be constant every time a thread runs; but are different for each thread. That is to say that each thread has its own stack.
With ARM Linux, a special co-processor register is used to implement thread local storage. The co-processor register is slower to access than a general purpose register, but it is still quite fast. That takes us to the difference between a process and a thread.
Endemic to threads
A process has a completely different memory layout. Ie, the mmu page tables switch for different processes. For a thread, the register set may be different, but all of regular memory is shared between threads. For this reason, there is lots of mutexes when you do thread programming.
Now, consider a CPU cache. It is ultra-fast memory just like a general purpose register. The only difference is the amount of instructions it takes to address it.
Answer
All of the OS's and CPUs already have this! Each thread shares memory and that memory is cached. Loading a global variable in two threads from cache is near as fast as register access. As the thread register you propose can only hold a pointer, you would need to de-reference it to access some larger entity. Loading a global variable will be nearly as fast and the compiler is free to put this in any register it likes. It is also possible for the compiler to use these registers in routines that don't need this access. So even, if there was an OS that reserved a general purpose register to be the same between threads, it would only be faster for a very small set of applications.
I have four threads, and i need to translate the data among these threads, the function like follow:
theadFunc(){
processing;
__sync();
processing;
}
Is there any sync functions in linux that make sure the threads will arrive at the same point.
In windows , I use atomic add and atomic compare to implement the __sync(), and i didn't find the atomic compare function in Linux.
You can use GCC's Atomic builtins to do a compare and swap, but you may want to consider using a pthreads barrier instead. See the documentation for pthread_barrier_init and pthread_barrier_wait for more information. You can also read this pthreads primer for a working example of barrier usage.
I just read a MSDN article, "Synchronization and Multiprocessor Issues", that addresses memory cache consistency issues on multiprocessor machines. This was really eye opening to me, because I would not have thought there could be a race condition in the example they provide. This article explains that writes to memory might not actually occur (from the perspective of the other cpu) in the order written in my code. This is a new concept to me!
This article provides 2 solutions:
Using the "volatile" keyword on variables that need cache consistency across multiple cpus. This is a C/C++ keyword, and not available to me in Delphi.
Using InterlockExchange() and InterlockCompareExchange(). This is something I could do in Delphi if I had to. It just seems a little messy.
The article also mentions that "The following synchronization functions use the appropriate barriers to ensure memory ordering: •Functions that enter or leave critical sections".
This is the part I don't understand. Does this mean that any writes to memory that are limited to functions that use critical sections are immune from cache consistency and memory ordering issues? I have nothing against the Interlock*() functions, but another tool in my tool belt would be good to have!
This MSDN article is just the first step of multi-thread application development: in short, it means "protect your shared variables with locks (aka critical sections), because you are not sure that the data you read/write is the same for all threads".
The CPU per-core cache is just one of the possible issues, which will lead into reading wrong values. Another issue which may lead into race condition is two threads writing to a resource at the same time: it's impossible to know which value will be stored afterward.
Since code expects the data to be coherent, some multi-thread programs may behave wrongly. With multi-threading, you are not sure that the code you write, via individual instructions, is executed as expected, when it deals with shared variables.
InterlockedExchange/InterlockedIncrement functions are low-level asm opcodes with a LOCK prefix (or locked by design, like the XCHG EDX,[EAX] opcode), which will indeed force the cache coherency for all CPU cores, and therefore make the asm opcode execution thread-safe.
For instance, here is how a string reference count is implemented when you assign a string value (see _LStrAsg in System.pas - this is from our optimized version of the RTL for Delphi 7/2002 - since Delphi original code is copyrighted):
MOV ECX,[EDX-skew].StrRec.refCnt
INC ECX { thread-unsafe increment ECX = reference count }
JG ##1 { ECX=-1 -> literal string -> jump not taken }
.....
##1: LOCK INC [EDX-skew].StrRec.refCnt { ATOMIC increment of reference count }
MOV ECX,[EAX]
...
There is a difference between the first INC ECX and LOCK INC [EDX-skew].StrRec.refCnt - not only the first increments ECX and not the reference count variable, but the first is not thread-safe, whereas the 2nd is prefixed by a LOCK therefore will be thread-safe.
By the way, this LOCK prefix is one of the problem of multi-thread scaling in the RTL - it's better with newer CPUs, but still not perfect.
So using critical sections is the easiest way of making a code thread-safe:
var GlobalVariable: string;
GlobalSection: TRTLCriticalSection;
procedure TThreadOne.Execute;
var LocalVariable: string;
begin
...
EnterCriticalSection(GlobalSection);
LocalVariable := GlobalVariable+'a'; { modify GlobalVariable }
GlobalVariable := LocalVariable;
LeaveCriticalSection(GlobalSection);
....
end;
procedure TThreadTwp.Execute;
var LocalVariable: string;
begin
...
EnterCriticalSection(GlobalSection);
LocalVariable := GlobalVariable; { thread-safe read GlobalVariable }
LeaveCriticalSection(GlobalSection);
....
end;
Using a local variable makes the critical section shorter, therefore your application will better scale and make use of the full power of your CPU cores. Between EnterCriticalSection and LeaveCriticalSection, only one thread will be running: other threads will wait in EnterCriticalSection call... So the shorter the critical section is, the faster your application is. Some wrongly designed multi-threaded applications can actually be slower than mono-threaded apps!
And do not forget that if your code inside the critical section may raise an exception, you should always write an explicit try ... finally LeaveCriticalSection() end; block to protect the lock release, and prevent any dead lock of your application.
Delphi is perfectly thread-safe if you protect your shared data with a lock, i.e. a Critical Section. Be aware that even reference-counted variables (like strings) should be protected, even if there is a LOCK inside their RTL functions: this LOCK is there to assume correct reference counting and avoid memory leaks, but it won't be thread-safe. To make it as fast as possible, see this SO question.
The purpose of InterlockExchange and InterlockCompareExchange is to change a shared pointer variable value. You can see it as a a "light" version of the critical section to access a pointer value.
In all cases, writing working multi-threaded code is not easy - it's even hard, as a Delphi expert just wrote in his blog.
You should either write simple threads with no shared data at all (make a private copy of the data before the thread starts, or use read-only shared data - which is thread-safe by essence), or call some well designed and proven libraries - like http://otl.17slon.com - which will save you a lot of debugging time.
First of all, according to the language standards, volatile doesn't do what the article says it does. The acquire and release semantics of volatile are MSVC specific. This can be a problem if you compile with other compilers or on other platforms. C++11 introduces language supported atomic variables which will hopefully, in due course, finally put an end to the (mis-)use of volatile as a threading construct.
Critical sections and mutexes are indeed implemented so that reads and writes of protected variables will be seen correctly from all threads.
I think the best way to think of critical sections and mutexes (locks) is as devices to bring about serialization. That is, blocks of code protected by such locks are executed serially, one after another without overlap. The serialization applies to memory access also. There can be no problems due to cache coherence or read/write reordering.
Interlocked functions are implemented using hardware based locks on the memory bus. These functions are used by lock free algorithms. What this means is that they don't use heavy weight locks like critical sections, but rather these light weight hardware locks.
Lock free algorithms can be more efficient than those based on locks, but lock free algorithms can be very much harder to write correctly. Prefer critical sections over lock free unless the performance implications are discernable.
Another article well worth reading is The "Double-Checked Locking is Broken" Declaration.
Back in my days as a BeOS programmer, I read this article by Benoit Schillings, describing how to create a "benaphore": a method of using atomic variable to enforce a critical section that avoids the need acquire/release a mutex in the common (no-contention) case.
I thought that was rather clever, and it seems like you could do the same trick on any platform that supports atomic-increment/decrement.
On the other hand, this looks like something that could just as easily be included in the standard mutex implementation itself... in which case implementing this logic in my program would be redundant and wouldn't provide any benefit.
Does anyone know if modern locking APIs (e.g. pthread_mutex_lock()/pthread_mutex_unlock()) use this trick internally? And if not, why not?
What your article describes is in common use today. Most often it's called "Critical Section", and it consists of an interlocked variable, a bunch of flags and an internal synchronization object (Mutex, if I remember correctly). Generally, in the scenarios with little contention, the Critical Section executes entirely in user mode, without involving the kernel synchronization object. This guarantees fast execution. When the contention is high, the kernel object is used for waiting, which releases the time slice conductive for faster turnaround.
Generally, there is very little sense in implementing synchronization primitives in this day and age. Operating systems come with a big variety of such objects, and they are optimized and tested in significantly wider range of scenarios than a single programmer can imagine. It literally takes years to invent, implement and test a good synchronization mechanism. That's not to say that there is no value in trying :)
Java's AbstractQueuedSynchronizer (and its sibling AbstractQueuedLongSynchronizer) works similarly, or at least it could be implemented similarly. These types form the basis for several concurrency primitives in the Java library, such as ReentrantLock and FutureTask.
It works by way of using an atomic integer to represent state. A lock may define the value 0 as unlocked, and 1 as locked. Any thread wishing to acquire the lock attempts to change the lock state from 0 to 1 via an atomic compare-and-set operation; if the attempt fails, the current state is not 0, which means that the lock is owned by some other thread.
AbstractQueuedSynchronizer also facilitates waiting on locks and notification of conditions by maintaining CLH queues, which are lock-free linked lists representing the line of threads waiting either to acquire the lock or to receive notification via a condition. Such notification moves one or all of the threads waiting on the condition to the head of the queue of those waiting to acquire the related lock.
Most of this machinery can be implemented in terms of an atomic integer representing the state as well as a couple of atomic pointers for each waiting queue. The actual scheduling of which threads will contend to inspect and change the state variable (via, say, AbstractQueuedSynchronizer#tryAcquire(int)) is outside the scope of such a library and falls to the host system's scheduler.