I plan on writing a multithreaded part in my game-project:
Thread A: loads bunch of objects from disk, which takes up to several seconds. Each object loaded increments a counter.
Thread B: a game loop, in which I either display loading screen with number of loaded objects, or start to manipulate objects once loading is done.
In code I believe it will look as following:
Counter = 0;
Objects;
THREAD A:
for (i = 0; i < ObjectsToLoad; ++i) {
Objects.push(LoadObject());
++Counter;
}
return;
THREAD B:
...
while (true) {
...
C = Counter;
if (C < ObjectsToLoad)
RenderLoadscreen(C);
else
WorkWithObjects(Objects)
...
}
...
Technically, this can be counted as a race condition - the object may be loaded but counter is not incremented yet, so B reads an old value. I also need to cache counter in B so its value won't change between check and rendering.
Now the question is - should I implement any synchronization mechanics here, like making counter atomic or introducing some mutex or conditional variable? The point here is that I can safely sacrifice an iteration of loop until the counter changes. And from what I get, as long as A only writes the value and B only checks it, everything is fine.
I've been discussing this question with a friend but we couldn't get to agreement, so we decided to ask for opinion of someone more competent in multithreading. The language is C++, if it helps.
You have to consider memory visibility / caching. Without memory barriers this can very well lead to delays of several seconds until the data is visible to Thread B(1).
This applies to both kind of data: The Counter and the Objects list.
The C++11 standard(2) guarantees that multithreaded programs are executed correctly only if you don't introduce race conditions. Without synchronization your program basically has undefined behaviour(3). However, in practice it might work without.
Yes, use a mutex and synchronize access to Counter and Objects.
(1) This is because each CPU core has its own registers and cache. If you don't tell the CPU Core A that some other Core B might be interested in the data, it can do optimizations by e.g. leaving the data in a register. Core A has to write the data to a higher level memory region (L2/L3 Cache or RAM) so Core B can load the changes.
(2) Any version before C++11 did not care about multithreading. There was support for mutexes, atomics etc. through third-party libraries but the language itself was thread-agnostic.
See: C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
(3) The problem is that your code can be reordered (for more efficient execution) at different stages: At the compiler, the assembler and also at the CPU. You must tell the computer which instructions need to stay in that order by adding memory barriers through atomics or mutexes. This works the same in most languages.
I'd recommend watching these very interesting videos about the C++11 memory model:
atomic<> weapons by Herb Sutter
IMO: If you identify data that is accessed by multiple threads, use synchronization. Multithreading-bugs are hard to track and reproduce, so better avoid them all together.
Race condition is typically only when two threads try to non-atomically read-modify-write concurrently the same datum. In this case, only one thread writes (thread A), while the other thread reads (thread B).
The only "incorrectness" you'll encounter is, as you said, if the object has been loaded but the counter hasn't been incremented. This causes B to read stale data, as the load-and-increment operation was not executed atomically.
If you don't mind this innocent anomaly, then it works just fine. :)
If this annoys you, then you need to execute all of the load-and-increment statements in one go (by using locks or any other synchronization primitive).
Related
In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.
Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.
isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.
According to wikipedia: A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.
Usually, articles talking about something like (I will use monitors instead of membars):
class ReadWriteExample {
int A = 0;
int Another = 0;
//thread1 runs this method
void writer () {
lock monitor1; //a new value will be stored
A = 10; //stores 10 to memory location A
unlock monitor1; //a new value is ready for reader to read
Another = 20; //#see my question
}
//thread2 runs this method
void reader () {
lock monitor1; //a new value will be read
assert A == 10; //loads from memory location A
print Another //#see my question
unlock monitor1;//a new value was just read
}
}
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
I.e. by definition operations issued prior to barrier can't be pushed down by compiler, but is it possible that operations issued after barrier would be occasionally seen before barrier? (just a probability)
Thanks
My answer below only addresses Java's memory model. The answer really can't be made for all languages as each may define the rules differently.
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
Your answer seems to be "Is it possible for the store of A = 20, be re-ordered above the unlock monitor?"
The answer is yes, it can be. If you look at the JSR 166 Cookbook, the first grid shown explains how re-orderings work.
In your writer case the first operation would be MonitorExit the second operation would be NormalStore. The grid explains, yes this sequence is permitted to be re-ordered.
This is known as Roach Motel ordering, that is, memory accesses can be moved into a synchronized block but cannot be moved out
What about another language? Well, this question is too broad to answer all questions as each may define the rules differently. If this is the case you would need to refine your question.
In Java there is the concept of happens-before. You can read all the details about it on in the Java Specification. A Java compiler or runtime engine can re-order code but it must abide by the happens-before rules. These rules are important for a Java developer that wants to have detailed control on how their code is re-ordered. I myself have been burnt by re-ordering code, turns out I was referencing the same object via two different variables and the runtime engine re-ordered my code not realizing that the operations were on the same object. If I had either a happens-before (between the two operations) or used the same variable, then the re-ordering would not have occurred.
Specifically:
It follows from the above definitions that:
An unlock on a monitor happens-before every subsequent lock on that monitor.
A write to a volatile field (§8.3.1.4) happens-before every subsequent
read of that field.
A call to start() on a thread happens-before any actions in the
started thread.
All actions in a thread happen-before any other thread successfully
returns from a join() on that thread.
The default initialization of any object happens-before any other
actions (other than default-writes) of a program.
Short answer - yes. This is very compiler and CPU architecture dependent. You have here the definition of a Race Condition. The scheduling Quantum won't end mid-instruction (can't have two writes to same location). However - the quantum could end between instructions - plus how they are executed out-of-order in the pipeline is architecture dependent (outside of the monitor block).
Now comes the "it depends" complications. The CPU guarantees little (see race condition). You might also look at NUMA (ccNUMA) - it is a method to scale CPU & Memory access by grouping CPUs (Nodes) with local RAM and a group owner - plus a special bus between Nodes.
The monitor doesn't prevent the other thread from running. It only prevents it from entering the code between the monitors. Therefore when the Writer exits the monitor-section it is free to execute the next statement - regardless of the other thread being inside the monitor. Monitors are gates that block access. Also - the quantum could interrupt the second thread after the A== statement - allowing Another to change value. Again - the quantum won't interrupt mid-instruction. Always think of threads executing in perfect parallel.
How do you apply this? I'm a bit out of date (sorry, C#/Java these days) with current Intel processors - and how their Pipelines work (hyperthreading etc). Years ago I worked with a processor called MIPS - and it had (through compiler instruction ordering) the ability to execute instructions that occurred serially AFTER a Branch instruction (Delay Slot). On this CPU/Compiler combination - YES - what you describe could happen. If Intel offers the same - then yes - it could happen. Esp with the NUMA (both Intel & AMD have this, I'm most familiar with AMD implementation).
My point - if threads were running across NUMA nodes - and access was to the common memory location then it could occur. Of course the OS tries hard to schedule operations within the same node.
You might be able to simulate this. I know C++ on MS allows access to NUMA technology (I've played with it). See if you can allocate memory across two nodes (placing A on one, and Another on the other). Schedule the threads to run on specific Nodes.
What happens in this model is that there are two pathways to RAM. I suppose this isn't what you had in mind - probably only a single path/Node model. In which case I go back to the MIPS model I described above.
I assumed a processor that interrupts - there are others that have a Yield model.
Lets say I have two threads reading and modifying a bool / int "state". The reads and writes are guaranteed to be atomic by the processor.
Thread 1:
if (state == ENABLED)
{
Process_Data()
}
Thread 2:
state = DISABLED
In this case yes the thread 1 can read the state and go into it's "if" to Process_Data and then Thread2 can change state. But it isn't incorrect at that point to still go on to Process_Data. Yes if we peek into the hood we have an inconsistency of state being DISABLED and us entering the Process_Data function. But after its executed the next time Thread1 executes it will get state = DISABLED and not Process_Data.
My question is do I still need a lock in both these threads to make Thread1's check-state-and-process atomic and Thread2's write atomic (wrt to Thread 1) ?
You've addressed the atomicity concerns. However, in modern processors, you have to worry not just about atomicity, but also memory visibility.
For example, thread 1 is executing on one processor, and reads ENABLED from state - from its processor's cache.
Meanwhile, thread 2 is executing on a different processor, and writes DISABLED to state on its processor's cache.
Without further code - in some languages, for example, declaring state volatile - the DISABLED value may not get flushed to main memory for a long time. It may never get flushed to main memory if thread 2 changes the value back to ENABLED eventually.
Meanwhile, even if the DISABLED value is flushed to main memory, thread 1 may never pick it up, instead continuing to use its cached value of ENABLED indefinitely.
Generally if you want to share values between threads, it's better to do so explicitly using the appropriate mechanisms for the programming language and environment that you're using.
There's no way to answer your question generically. If the specification for the language, compiler, threading library and/or platform you are using says you need protection, then you do. If it says you don't, then you don't. I believe every threading library or multi-threading implementation specifies rules for sane use and sharing of data. If yours doesn't, it's a piece of junk that is impossible to use reliably and you should get a better one.
Do not make the mistake of thinking, "This is safe because I can't think of any way it can go wrong." Or "I tested this, and I couldn't get it to fail, so it's safe." That kind of thinking produces fragile code that tends to fail when you change compiler options, upgrade your CPU, or run the program on a different platform. Follow the specifications for the tools you are using.
I'm reading the book Crack Code Interview recently, but there's one paragraph confusing me a lot on page 257:
A thread is a particular execution path of a process; when one thread modifies a process resource, the change is immediately visible to sibling threads.
IIRC, if one thread make a change to a variable, the change will firstly save in the CPU cache (say, L1 cache), and will not guarantee to synchronize to other threads unless the variable is declared as volatile.
Am I right?
Nope, you're wrong. But this is a very common misunderstanding.
Every modern multi-core CPU has hardware cache coherence. The L1, and similar caches, are invisible. CPU caches like the L1 cache have nothing to do with memory visibility.
Changes are visible immediately when a thread modifies a process resource. The issue is optimizations that cause process resources not to be modified in precisely the order the code specifies.
If your code has k = j; i = 4; if (j == 2) foo(); an optimizer might see that your first assignment reads the value of j. So it might not bother reading it again when you compare it to 2 since it "knows" that it can't have changed. However, another thread might have changed it. So optimizations of some kinds need to be disabled when synchronization between threads is required. That's what things like volatile do.
If compilers and CPUs made no optimizations and executed a program precisely as it was written, volatile would never be needed. Memory visibility is about optimizations in code (some done by the compiler, some by the CPU), not caches.
I think the text you are quoting is incorrect. The whole idea of the Java Memory Model is to deal with the complex optimizations by modern software & hardware, so that programmers can determine what writes are visible by the respective reads in other threads.
Unless a program in Java is properly synchronized, you can't guarantee that changes by one thread are immediately visible to other threads. Maybe the text refers to a very specific (and weak) memory model.
Usage of volatile variables is just one way to synchronize threads, and it's not suitable for all scenarios.
--Edit--
I think I understand the confusion now... I agree with David Schwartz, assuming that:
1) "modifies a process resource" means the actual change of the resource, not just the execution of a write instruction written in some high level computer language.
2) "is immediately visible to sibling threads" means that other threads are able to see it; it doesn't mean that a thread in your program will necessarily see it. You may still need to use synchronization tools in order to disable optimizations that bypass the actual access to the resource.
Background
I've been reading through various books and articles to learn about processor caches, cache consistency, and memory barriers in the context of concurrent execution. So far though, I have been unable to determine whether a common coding practice of mine is safe in the strictest sense.
Assumptions
The following pseudo-code is executed on a two-processor machine:
int sharedVar = 0;
myThread()
{
print(sharedVar);
}
main()
{
sharedVar = 1;
spawnThread(myThread);
sleep(-1);
}
main() executes on processor 1 (P1), while myThread() executes on P2.
Initially, sharedVar exists in the caches of both P1 and P2 with the initial value of 0 (due to some "warm-up code" that isn't shown above.)
Question
Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
With my newfound knowledge of processor caches, it seems entirely possible that at the time of the print() statement, P2 may not have received the invalidation request for sharedVar caused by P1's assignment in main(). Therefore, it seems possible that myThread() could print 0.
References
These are the related articles and books I've been reading:
Shared Memory Consistency Models: A Tutorial
Memory Barriers: a Hardware View for Software Hackers
Linux Kernel Memory Barriers
Computer Architecture: A Quantitative Approach
Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
Theoretically, it can print either 0 or 1, even on x86, since stores can move after loads on almost any architecture.
In practice, it would be hard to make myThread() print 0.
Spawning a thread will most likely function as an implicit store/release memory barrier, since it would probably:
- have at least one instruction along the execution path that results in a memory barrier - interlocked instructions, explicit memory barrier instructions, etc.,
- or the store would simply be retired/drained from the store buffer by the time myThread() is called, since setting up a new thread results in executing many instructions - among them, many stores.
I'll speak only to Java here: myThread() is guaranteed to print 1, due to the happens before definition from the The Java Language Specification (Section 17.4.5).
The write to sharedVar in main() happens before spawning the thread with function myThread() because the variable assignment comes first in program order. Next, spawning a thread happens before any actions in the thread being started. By the transitivity of the definition in Section 17.4.5 (hb(x, y) and hb(y, z) implies hb(x, z)), writing to the variable sharedVar happens before print() reads sharedVar in myThread().
You might also enjoy reading Brian Goetz's article Java theory and practice: Fixing the Java Memory Model, Part 2 covering this subject, as well as his book Java Concurrency in Practice.