I just read a MSDN article, "Synchronization and Multiprocessor Issues", that addresses memory cache consistency issues on multiprocessor machines. This was really eye opening to me, because I would not have thought there could be a race condition in the example they provide. This article explains that writes to memory might not actually occur (from the perspective of the other cpu) in the order written in my code. This is a new concept to me!
This article provides 2 solutions:
Using the "volatile" keyword on variables that need cache consistency across multiple cpus. This is a C/C++ keyword, and not available to me in Delphi.
Using InterlockExchange() and InterlockCompareExchange(). This is something I could do in Delphi if I had to. It just seems a little messy.
The article also mentions that "The following synchronization functions use the appropriate barriers to ensure memory ordering: •Functions that enter or leave critical sections".
This is the part I don't understand. Does this mean that any writes to memory that are limited to functions that use critical sections are immune from cache consistency and memory ordering issues? I have nothing against the Interlock*() functions, but another tool in my tool belt would be good to have!
This MSDN article is just the first step of multi-thread application development: in short, it means "protect your shared variables with locks (aka critical sections), because you are not sure that the data you read/write is the same for all threads".
The CPU per-core cache is just one of the possible issues, which will lead into reading wrong values. Another issue which may lead into race condition is two threads writing to a resource at the same time: it's impossible to know which value will be stored afterward.
Since code expects the data to be coherent, some multi-thread programs may behave wrongly. With multi-threading, you are not sure that the code you write, via individual instructions, is executed as expected, when it deals with shared variables.
InterlockedExchange/InterlockedIncrement functions are low-level asm opcodes with a LOCK prefix (or locked by design, like the XCHG EDX,[EAX] opcode), which will indeed force the cache coherency for all CPU cores, and therefore make the asm opcode execution thread-safe.
For instance, here is how a string reference count is implemented when you assign a string value (see _LStrAsg in System.pas - this is from our optimized version of the RTL for Delphi 7/2002 - since Delphi original code is copyrighted):
MOV ECX,[EDX-skew].StrRec.refCnt
INC ECX { thread-unsafe increment ECX = reference count }
JG ##1 { ECX=-1 -> literal string -> jump not taken }
.....
##1: LOCK INC [EDX-skew].StrRec.refCnt { ATOMIC increment of reference count }
MOV ECX,[EAX]
...
There is a difference between the first INC ECX and LOCK INC [EDX-skew].StrRec.refCnt - not only the first increments ECX and not the reference count variable, but the first is not thread-safe, whereas the 2nd is prefixed by a LOCK therefore will be thread-safe.
By the way, this LOCK prefix is one of the problem of multi-thread scaling in the RTL - it's better with newer CPUs, but still not perfect.
So using critical sections is the easiest way of making a code thread-safe:
var GlobalVariable: string;
GlobalSection: TRTLCriticalSection;
procedure TThreadOne.Execute;
var LocalVariable: string;
begin
...
EnterCriticalSection(GlobalSection);
LocalVariable := GlobalVariable+'a'; { modify GlobalVariable }
GlobalVariable := LocalVariable;
LeaveCriticalSection(GlobalSection);
....
end;
procedure TThreadTwp.Execute;
var LocalVariable: string;
begin
...
EnterCriticalSection(GlobalSection);
LocalVariable := GlobalVariable; { thread-safe read GlobalVariable }
LeaveCriticalSection(GlobalSection);
....
end;
Using a local variable makes the critical section shorter, therefore your application will better scale and make use of the full power of your CPU cores. Between EnterCriticalSection and LeaveCriticalSection, only one thread will be running: other threads will wait in EnterCriticalSection call... So the shorter the critical section is, the faster your application is. Some wrongly designed multi-threaded applications can actually be slower than mono-threaded apps!
And do not forget that if your code inside the critical section may raise an exception, you should always write an explicit try ... finally LeaveCriticalSection() end; block to protect the lock release, and prevent any dead lock of your application.
Delphi is perfectly thread-safe if you protect your shared data with a lock, i.e. a Critical Section. Be aware that even reference-counted variables (like strings) should be protected, even if there is a LOCK inside their RTL functions: this LOCK is there to assume correct reference counting and avoid memory leaks, but it won't be thread-safe. To make it as fast as possible, see this SO question.
The purpose of InterlockExchange and InterlockCompareExchange is to change a shared pointer variable value. You can see it as a a "light" version of the critical section to access a pointer value.
In all cases, writing working multi-threaded code is not easy - it's even hard, as a Delphi expert just wrote in his blog.
You should either write simple threads with no shared data at all (make a private copy of the data before the thread starts, or use read-only shared data - which is thread-safe by essence), or call some well designed and proven libraries - like http://otl.17slon.com - which will save you a lot of debugging time.
First of all, according to the language standards, volatile doesn't do what the article says it does. The acquire and release semantics of volatile are MSVC specific. This can be a problem if you compile with other compilers or on other platforms. C++11 introduces language supported atomic variables which will hopefully, in due course, finally put an end to the (mis-)use of volatile as a threading construct.
Critical sections and mutexes are indeed implemented so that reads and writes of protected variables will be seen correctly from all threads.
I think the best way to think of critical sections and mutexes (locks) is as devices to bring about serialization. That is, blocks of code protected by such locks are executed serially, one after another without overlap. The serialization applies to memory access also. There can be no problems due to cache coherence or read/write reordering.
Interlocked functions are implemented using hardware based locks on the memory bus. These functions are used by lock free algorithms. What this means is that they don't use heavy weight locks like critical sections, but rather these light weight hardware locks.
Lock free algorithms can be more efficient than those based on locks, but lock free algorithms can be very much harder to write correctly. Prefer critical sections over lock free unless the performance implications are discernable.
Another article well worth reading is The "Double-Checked Locking is Broken" Declaration.
Related
I have started working to learn multicore programming. I started leaning c++11 atomics. I would like to know if all he shared variables needs to be atomic?
The only time a variable needs to be "atomic", i.e., can be updated in "one fell swoop, without any other thread being able to read it meanwhile", is if it even can be read while someone else is updating it. For instance, if it's set on initialization and then never changes, no one can ever read it while it's changing, and so it doesn't have to be atomic. On the other hand, if it ever changes after initialization and there's a risk that anyone else than the changing thread reads it while it's changing, then it needs to be atomic (atomic intrinsics or protected by mutex or otherwise)
If multiple thread accessing (read/write) the same variable then it should be atomic.
Also you can go though this.
Not necessarily for all scenarios. Also, note that atomicity of variable access alone does not guarantee full thread safety. It just ensures that the particular variable being read is got as a whole.In some architectures, the read operation does not happen in single assembly instruction. For example, if you are reading 64 bit value, the compiler might implement the read using two load instructions of assembly such that the 1st instruction reads the lower 32 bits and the 2nd instruction reads the higher 32 bits. This in turn can lead to race condition. So, atomic reads are preferred.
I was implementing some simple Producer/Consumer program that had some semaphores and shared memory. To keep things simple, let's assume that there's just a block of shared memory and a semaphore in my program.
At first, I though that I only had to consider as critical section bits of code that'd try to write to the shared memory block. But as the shared memory block consists of, let's say, 1024bytes, I can't read all the data at the same time (it's not an atomic operation), so it is indeed possible that while I'm reading from it, the Producer comes and starts writing in it, so the reader will get half old data, half new data. From this, I can only think that I also have to put shared memory reading logic inside a "semaphore" block.
Now, I have lots of code that looks like this:
if (sharedMemory[0] == '0') { ... }
In this case, I am just looking for a single char in memory. I guess I don't have to worry about puting a semaphore around this, do I?
And what if instead I have something like
if (sharedMemory[0] == '0' && sharedMemory[1] == '1') { ... }
From my perspective, I guess that as this are 2 operations, I'd have to consider this as a critical section, thus having to put a semaphore around it. Am I right?
Thanks!
Technically, on a multicore or multiprocessor system the only thing that's atomic are assembly opcodes which are specifically documented as being atomic. Even reading a single byte presents a (quite small) chance the another processor will come along and modify it before you're doing reading it, except in some cases that deals with CPU cache and aligned chunks of memory (Fun thread: http://software.intel.com/en-us/forums/showthread.php?t=76744, Interesting read: http://www.corensic.com/CorensicBlog/tabid/101/EntryId/8/Memory-Consistency-Models.aspx)
You must either use types which internally guarantee atomicity or specifically protect accesses on multithreaded multicore systems.
(The answer may change slightly on IL platforms like .NET and JVMs since they make their own guarantees about what's atomic and what isn't).
Definitely lock around non-atomic operations, and checking two different values counts as as a non-atomic operation, although there are tricks you can use to check up to four bytes or more, provided your processor doesn't cache the results. You have to consider how your data is used. But basically, any access to shared memory should have have a semaphore around it.
Is it safe for a thread to READ a variable set by a Delphi VCL event?
When a user clicks on a VCL TCheckbox, the main thread sets a boolean to the checkbox's Checked state.
CheckboxState := CheckBox1.Checked;
At any time, a thread reads that variable
if CheckBoxState then ...
It doesn't matter if the thread "misses" a change to the boolean, because the thread checks the variable in a loop as it does other things. So it will see the state change eventually...
Is this safe? Or do I need special code? Is surrounding the read and write of the variable (in the thread and main thread respectively) with critical code calls necessary and sufficient?
As I said, it doesn't matter if the thread gets the "wrong" value, but I keep thinking that there might be a low-level problem if one thread tries to read a variable while the main thread is in the middle of writing it, or vice versa.
My question is similar to this one: Cross thread reading of a variable who's value is not considered important.
(Also related to my previous question: Using EnterCriticalSection in Thread to update VCL label)
This is safe, for three reasons:
Only one thread writes to the variable.
The variable is only one byte, so there is no way to read an inconsistent value. It will be read either as True or as False. There can't be alignment issues with Delphi boolean values.
The Delphi compiler does no extensive checks whether a variable is actually written to, and does not "optimize" away any code if not. Non-local variables will always be read, there is no need for the volatile specifier.
Having said that, if you are really unsure about this you could use an integer value instead of the boolean, and use the InterlockedExchange() function to write to the variable. This is overkill here, but it's a good technique to know about, because for single machine word sized values it may eliminate the need for locks.
You can also replace the boolean by a proper synchronization primitive, like an event, and have the thread block on that - this would help you eliminate busy loops in the thread.
In your case (Checked property) a read operation is atomic, so it is safe. That is the same as with TThread.Terminated property; the simple read and write operations for the properly aligned bytes, words and doublewords are atomic. You can check intel documentation for more information:
CHAPTER 8 - MULTIPLE-PROCESSOR MANAGEMENT
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
For the example you give, it will be safe. Technically the main issue is in cases where your variable goes over a machine word in size, and thus you might get the high word and long word not synchronized. For small values though, this is not an issue.
If you think it is a potential problem, for example using a pointer, then the thing to use is a TCriticalSection to control read and writes to the item. This is fast enough for all practical situations, and ensures you are 100% safe.
While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level.
I imagine there being a set of memory "flags" saying:
lock A is held by thread 1
lock B is held by thread 3
lock C is not held by any thread
etc
But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:
mov edx, [myThreadId]
wait:
cmp [lock], 0
jne wait
mov [lock], edx
; I wanted an exclusive lock but the above
; three instructions are not an atomic operation :(
In practice, these tend to be implemented with CAS and LL/SC.
(...and some spinning before giving up the time slice of the thread - usually by calling into a kernel function that switches context.)
If you only need a spinlock, wikipedia gives you an example which trades CAS for lock prefixed xchg on x86/x64. So in a strict sense, a CAS is not needed for crafting a spinlock - but some kind of atomicity is still required. In this case, it makes use of an atomic operation that can write a register to memory and return the previous contents of that memory slot in a single step. (To clarify a bit more: the lock prefix asserts the #LOCK signal that ensures that the current CPU has exclusive access to the memory. On todays CPUs it is not necessarily carried out this way, but the effect is the same. By using xchg we make sure that we will not get preempted somewhere between reading and writing, since instructions will not be interrupted half-way. So if we had an imaginary lock mov reg0, mem / lock mov mem, reg1 pair (which we don't), that would not quite be the same - it could be preempted just between the two movs.)
On current architectures, as pointed out in the comments, you mostly end up using the atomic primitives of the CPU and the coherency protocols provided by the memory subsystem.
For this reason, you not only have to use these primitives, but also account for the cache/memory coherency guaranteed by the architecture.
There may be implementation nuances as well. Considering e.g. a spinlock:
instead of a naive implementation, you should probably use e.g. a TTAS spin-lock with some exponential backoff,
on a Hyper-Threaded CPU, you should probably issue pause instructions that serve as hints that you're spinning - so that the core you are running on can do something useful during this
you should really give up on spinning and yield control to other threads after a while
etc...
this is still user mode - if you are writing a kernel, you might have some other tools that you can use as well (since you are the one that schedules threads and handles/enables/disables interrupts).
The x86 architecture, has long had an instruction called xchg which will exchange the contents of a register with a memory location. xchg has always been atomic.
There has also always been a lock prefix that could be applied to any a single instruction to make that instruction atomic. Before there were multi processor systems, all this really did was to prevent an interrupt from being delivered in the middle of a locked instruction. (xchg was implicitly locked).
This article has some sample code using xchg to implement a spinlock
http://en.wikipedia.org/wiki/Spinlock
When multi CPU and later multi Core systems began to be built, more sophisticated systems were needed to insure that lock and xchg would synchronize all of the memory subsystems, including l1 cache on all of the processors. About this time, new research into locking and lockless algorithms showed that atomic CompareAndSet was a more flexible primitive to have, so more modern CPUs have that as an instruction.
Addendum: In comments andras supplied a "dusty old" list of instructions which allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm
I like to think of thread synchronization as a bottom up where processor and operating system provide construct that are primitive to more sophisticated
At the processor level you have CAS and LL/SC which allow you to perform a test and store in a single atomic operation ... you also have other processor constructs that allow you to disable and enable interrupt (however they are considered dangerous ... under certain circumstances you have no other option but to use them)
operating system provides the ability to context switch between tasks which can happen every time a thread has used its time slice ... or it can happen due to otgher reasons (I will come to that)
then there are higher level constructs like mutexes which uses these primitive mechanisms provided by processor (think spinning mutex) ... which will continuously wait for the condition to become true and checks for that condition atomically
then these spinning mutex can use the functionality provided by OS (context switch and system calls like yield which relinquishes the control to another thread) and gives us mutexes
these constructs are further utilized by higher level constructs like conditional variables (which can keep track of how many threads are waiting for the mutex and which thread to allow first when the mutex becomes available)
These constructs than can be further used to provide more sophisticated synchronization constructs ... example : semaphores etc
Inspired by this question: In Complexity Analysis why is ++ considered to be 2 operations?
Take the following psuedo code:
class test
{
int _counter;
void Increment()
{
_counter++;
}
}
Would this be considered thread safe on an x86 architechure? Further more are the Inc / Dec assembly instructions thread safe?
No, incrementing is not thread-safe. Neither are the INC and DEC instructions. They all require a load and a store, and a thread running on another CPU could do its own load or store on the same memory location interleaved between those operations.
Some languages have built-in support for thread synchronization, but it's usually something you have to ask for, not something you get automatically on every variable. Those that don't have built-in support usually have access to a library that provides similar functionality.
In a word, no.
You can use something like InterlockedIncrement() depending on your platform. On .NET you can use the Interlocked class methods (Interlocked.Increment() for example).
A Rob Kennedy mentioned, even if the operation is implemented in terms of a single INC instruction, as far as the memory is concerned a read/increment/write set of steps is performed. There is the opportunity on a multi-processor system for corruption.
There's also the volatile issue, which would be a necessary part of making the operation thread-safe - however, marking the variable volatile is not sufficient to make it thread-safe. Use the interlocked support the platform provides.
This is true in general, and on x86/x64 platforms certainly.