If you have two threads in the same processor, you can have a torn read/write.
For example, on a 32 bit system with thread 1 and thread 2 running on the same core:
Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
Thread 2 reads X as 0xffffffff00000000
Thread 1 writes the last 32 bits.
The torn read happens in step 3.
But what if the following conditions are met:
Thread 1 and Thread 2 are pinned to different cores
The system uses MESI protocol to achieve cache coherence
In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?
Yes, you can have tearing.
A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.
Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.
The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.
(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)
Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.
If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?
By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.
A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.
But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)
Further reading:
Atomicity on x86 - how exactly a single load or store can be atomic
Why is integer assignment on a naturally aligned variable atomic on x86?
Can num++ be atomic for 'int num'? - how atomic RMWs interact with MESI. (For x86-style single instructions like lock add [mem], eax. LL/SC machines just detect that they lost control of the cache line in there somewhere and report failure.)
how compiler makes sure that the red zone is not clobbered? Is there any overallocation of space?
And what factors lead to choosing 128 byte as the size of red zone?
The compiler doesn't, it just takes advantage of the guarantee that space below RSP won't be asynchronously clobbered (e.g. by signal handlers). Making a function call will of course synchronously clobber it.
In fact, on Linux only signal handlers run asynchronously in user-space code. (The kernel stack gets interrupts: Why can't kernel code use a Red Zone)
The kernel implements the red-zone when delivering signals to user-space. I think that's about it; it's really pretty easy to implement.
The other thing that's relevant is when a debugger runs a function when you do something like print foo(123) in GDB. GDB will actually run that function using the stack of the current thread. In an ABI with a red-zone, GDB (or any other debugger) has to respect it when invoking that function by doing rsp -= 128 after saving register state to restore when the user doing continue or single-step.
In i386 System V, print foo(123) will use space right below the current ESP, stepping on whatever was below ESP. (I think; not tested).
And what factors lead to choosing 128 byte as the size of red zone?
A signed byte displacement in an addressing mode like [rsp - 128] can reach that far. IIRC, the amd64.org mailing archive I was looking through while answering Why does Windows64 use a different calling convention from all other OSes on x86-64? actually included a message citing that as the reason for that specific choice.
You want it to be large enough that many simple leaf functions don't need to move RSP. e.g. at least 16 or 32 bytes, like the 32-byte shadow space in MS's Windows x64 calling convention.
You want it to be small enough that skipping over it to invoke a signal handler doesn't need to touch huge amounts more space, like new pages. So much less than 4kB.
A leaf function that needs more than 128 bytes of locals is probably big enough that moving RSP is a drop in the bucket. And then the +-disp8 addressing mode benefit comes into play, giving access to a whole 256 bytes of space with compact addressing modes from byte [rsp+127] to byte [rsp-128] or in dword/qword chunks.
Further reading
Reading why it's not safe to use space below ESP on Windows, or Linux without a red-zone, is illuminating.
Raymond Chen's blog: Why do we even need to define a red zone? Can’t I just use my stack for anything?
Also my SO answer covers some of the same ground: Is it valid to write below ESP? (but with more guesswork and less interesting Windows details than Raymond.)
Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.
I am studying about how CPU changes from user mode to kernel mode in linux. I came across two different methods: Interrupts and using sysenter.
I could not understand how sysenter works. Could someone please explain what exactly happens in the cpu when the sysenter instruction is run?
The problem that a program faces when it wants to get into the kernel (aka "making syscalls") is that user programs cannot access anything kernel-related, yet the program has to somehow switch the CPU into "kernel mode".
On an interrupt, this is done by the hardware.
It also happens automatically when a (CPU-, not C++) exception occurs, like accessing memory that doesn't exist, a divison by zero, or invoking a privileged instruction in user code. Or trying to execute an unimplemented instruction. This last thing is actually a decent way to implement a "call the kernel" interface: CPU runs on an instruction that the CPU doesn't know, so it raises an exception which drops the CPU into kernel mode and into the kernel. The kernel code could then check whether the "correct" unmiplemented instruction was used and perform the syscall stuff if it was, or just kill the process if it was any other unimplemented instruction.
Of course, doing something like this isn't, well, "clean". It's more like a dirty hack, abusing what should be an error to implement a perfectly valid control flow change. Hence, CPUs do tend to have actual instructions to do essentially the same thing, just in a more "defined" way. The main purpose of anything like a "sysenter" instruction is still the same: it changes the CPU into "kernel mode", saves the position where the "sysenter" was called, and continues execution somewhere in the kernel.
As for the difference between a "software interrupt" and "sysenter": "sysenter" is specifically optimized for this kind of use case. For example, it doesn't get the kernel address to call from memory like a (software-)interrupt does, but instead uses a special register to get the address from, which saves the memory address lookup. It might also have additional optimizations internally, based on the fact that software-interrupts might be handled more like interrupts, and the sysenter instruction doesn't actually need that. I don't know the precise details of the implementations of these instructions on the CPUs, you would probably have to read the Intel manuals to really get into such details.
This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 2 years ago.
Today I came across this question:
you have a code
static int counter = 0;
void worker() {
for (int i = 1; i <= 10; i++)
counter++;
}
If worker would be called from two different threads, what value will counter have after both of them are finished?
I know that actually it could be anything. But my internal guts tells me, that counter++ will most likely be translated into single assembler instruction, and if both threads are execute on the same core, counter will be 20.
But what if those threads are run on different cores or processors, could there be a race condition in their microcode? Is one assembler instruction could always be viewed as atomic operation?
Specifically for x86, and regarding your example: counter++, there are a number of ways it could be compiled. The most trivial example is:
inc counter
This translates into the following micro operations:
load counter to a hidden register on the CPU
increment the register
store the updated register in counter
This is essentially the same as:
mov eax, counter
inc eax
mov counter, eax
Note that if some other agent updates counter between the load and the store, it won't be reflected in counter after the store. This agent could be another thread in the same core, another core in the same CPU, another CPU in the same system, or even some external agent that uses DMA (Direct Memory Access).
If you want to guarantee that this inc is atomic, use the lock prefix:
lock inc counter
lock guarantees that nobody can update counter between the load and the store.
Regarding more complicated instructions, you usually can't assume that they'll execute atomically, unless they support the lock prefix.
The answer is: it depends!
Here is some confusion around, what an assembler instruction is. Normally, one assembler instruction is translated into exactly one machine instruction. The excemption is when you use macros -- but you should be aware of that.
That said, the question boils down is one machine instruction atomic?
In the good old days, it was. But today, with complex CPUs, long running instructions, hyperthreading, ... it is not. Some CPUs guarantee that some increment/decrement instructions are atomic. The reason is, that they are neat for very simple syncronizing.
Also some CPU commands are not so problematic. When you have a simple fetch (of one piece of data that the processor can fetch in one piece) -- the fetch itself is of course atomic, because there is nothing to be divided at all. But when you have unaligned data, it becomes complicated again.
The answer is: It depends. Carefully read the machine instruction manual of the vendor. In doubt, it is not!
Edit:
Oh, I saw it now, you also ask for ++counter. The statement "most likely to be translated" can not be trusted at all. This largely depends also on the compiler of course! It gets more difficult when the compiler is making different optimizations.
Not always - on some architectures one assembly instruction is translated into one machine code instruction, while on others it does not.
In addition - you can never assume that the program language you are using is compiling a seemingly simple line of code into one assembly instruction. Moreover, on some architectures, you cannot assume one machine code will execute atomically.
Use proper synchronization techniques instead, dependent on the language you are coding in.
Increment/decrement operations on 32-bit or less integer variables on a single 32-bit processor with no Hyper-Threading Technology are atomic.
On a processor with Hyper-Threading Technology or on a multi-processor system, the increment/decrement operations are NOT guaranteed to be executed atomicaly.
Invalidated by Nathan's comment:
If I remember my Intel x86 assembler correctly, the INC instruction only works for registers and does not directly work for memory locations.
So a counter++ would not be a single instruction in assembler (just ignoring the post-increment part). It would be at least three instructions: load counter variable to register, increment register, load register back to counter. And that is just for x86 architecture.
In short, don't rely on it being atomic unless it is specified by the language specification and that the compiler that you are using supports the specifications.
Another issue is that if you don't declare the variable as volatile, the code generated would probably not update the memory at every loop iteration, only at the end of the loop the memory would be updated.
No you cannot assume this. Unless it clearly stated in compiler specification. And moreover no one can guarantee that one single assembler instruction indeed atomic. In practice each assembler instruction is translated to number of microcode operation - uops.
Also the issue of race condition is tightly coupled with memory model(coherence, sequential, release coherence and etc.), for each one the answer and result could be different.
In most cases, no. In fact, on x86, you can perform the instruction
push [address]
which, in C, would be something like:
*stack-- = *address;
This performs two memory transfers in one instruction.
That's basically impossible to do in 1 clock cycle, not the least because one memory transfer is also not possible in one cycle!
Might not be an actual answer to your question, but (assuming this is C#, or another .NET language) if you want counter++ to really be multi-threaded atomic, you could use System.Threading.Interlocked.Increment(counter).
See other answers for actual information on the many different ways why/how counter++ could not be atomic. ;-)
On many other processors, the seperation between memory system and processor is bigger. (often these processor can be little or big-endian depending on memory system, like ARM and PowerPC), this also has consequences for atomic behaviour if the memory system can reorder reads and writes.
For this purpose, there are memory barriers (http://en.wikipedia.org/wiki/Memory_barrier)
So in short, while atomic instructions are enough on intel (with the relevant lock prefixes), more must be done on non-intel, since the memory I/O might not be in the same order.
This is a known problem when porting "lock-free" solutions from Intel to other architectures.
(Note that multiprocessor (not multicore) systems on x86 also seem to need memory barriers, at least in 64-bit mode.
I think that you'll get a race condition on access.
If you wanted to ensure an atomic operation in incrementing counter then you'd need to use ++counter.