Does one assembler instruction always execute atomically? [duplicate]

Does one assembler instruction always execute atomically? [duplicate] - multithreading

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 2 years ago.
Today I came across this question:
you have a code
static int counter = 0;
void worker() {
for (int i = 1; i <= 10; i++)
counter++;
}
If worker would be called from two different threads, what value will counter have after both of them are finished?
I know that actually it could be anything. But my internal guts tells me, that counter++ will most likely be translated into single assembler instruction, and if both threads are execute on the same core, counter will be 20.
But what if those threads are run on different cores or processors, could there be a race condition in their microcode? Is one assembler instruction could always be viewed as atomic operation?

Specifically for x86, and regarding your example: counter++, there are a number of ways it could be compiled. The most trivial example is:
inc counter
This translates into the following micro operations:
load counter to a hidden register on the CPU
increment the register
store the updated register in counter
This is essentially the same as:
mov eax, counter
inc eax
mov counter, eax
Note that if some other agent updates counter between the load and the store, it won't be reflected in counter after the store. This agent could be another thread in the same core, another core in the same CPU, another CPU in the same system, or even some external agent that uses DMA (Direct Memory Access).
If you want to guarantee that this inc is atomic, use the lock prefix:
lock inc counter
lock guarantees that nobody can update counter between the load and the store.
Regarding more complicated instructions, you usually can't assume that they'll execute atomically, unless they support the lock prefix.

The answer is: it depends!
Here is some confusion around, what an assembler instruction is. Normally, one assembler instruction is translated into exactly one machine instruction. The excemption is when you use macros -- but you should be aware of that.
That said, the question boils down is one machine instruction atomic?
In the good old days, it was. But today, with complex CPUs, long running instructions, hyperthreading, ... it is not. Some CPUs guarantee that some increment/decrement instructions are atomic. The reason is, that they are neat for very simple syncronizing.
Also some CPU commands are not so problematic. When you have a simple fetch (of one piece of data that the processor can fetch in one piece) -- the fetch itself is of course atomic, because there is nothing to be divided at all. But when you have unaligned data, it becomes complicated again.
The answer is: It depends. Carefully read the machine instruction manual of the vendor. In doubt, it is not!
Edit:
Oh, I saw it now, you also ask for ++counter. The statement "most likely to be translated" can not be trusted at all. This largely depends also on the compiler of course! It gets more difficult when the compiler is making different optimizations.

Not always - on some architectures one assembly instruction is translated into one machine code instruction, while on others it does not.
In addition - you can never assume that the program language you are using is compiling a seemingly simple line of code into one assembly instruction. Moreover, on some architectures, you cannot assume one machine code will execute atomically.
Use proper synchronization techniques instead, dependent on the language you are coding in.

Increment/decrement operations on 32-bit or less integer variables on a single 32-bit processor with no Hyper-Threading Technology are atomic.
On a processor with Hyper-Threading Technology or on a multi-processor system, the increment/decrement operations are NOT guaranteed to be executed atomicaly.

Invalidated by Nathan's comment:
If I remember my Intel x86 assembler correctly, the INC instruction only works for registers and does not directly work for memory locations.
So a counter++ would not be a single instruction in assembler (just ignoring the post-increment part). It would be at least three instructions: load counter variable to register, increment register, load register back to counter. And that is just for x86 architecture.
In short, don't rely on it being atomic unless it is specified by the language specification and that the compiler that you are using supports the specifications.

Another issue is that if you don't declare the variable as volatile, the code generated would probably not update the memory at every loop iteration, only at the end of the loop the memory would be updated.

No you cannot assume this. Unless it clearly stated in compiler specification. And moreover no one can guarantee that one single assembler instruction indeed atomic. In practice each assembler instruction is translated to number of microcode operation - uops.
Also the issue of race condition is tightly coupled with memory model(coherence, sequential, release coherence and etc.), for each one the answer and result could be different.

In most cases, no. In fact, on x86, you can perform the instruction
push [address]
which, in C, would be something like:
*stack-- = *address;
This performs two memory transfers in one instruction.
That's basically impossible to do in 1 clock cycle, not the least because one memory transfer is also not possible in one cycle!

Might not be an actual answer to your question, but (assuming this is C#, or another .NET language) if you want counter++ to really be multi-threaded atomic, you could use System.Threading.Interlocked.Increment(counter).
See other answers for actual information on the many different ways why/how counter++ could not be atomic. ;-)

On many other processors, the seperation between memory system and processor is bigger. (often these processor can be little or big-endian depending on memory system, like ARM and PowerPC), this also has consequences for atomic behaviour if the memory system can reorder reads and writes.
For this purpose, there are memory barriers (http://en.wikipedia.org/wiki/Memory_barrier)
So in short, while atomic instructions are enough on intel (with the relevant lock prefixes), more must be done on non-intel, since the memory I/O might not be in the same order.
This is a known problem when porting "lock-free" solutions from Intel to other architectures.
(Note that multiprocessor (not multicore) systems on x86 also seem to need memory barriers, at least in 64-bit mode.

I think that you'll get a race condition on access.
If you wanted to ensure an atomic operation in incrementing counter then you'd need to use ++counter.

Related

Can you have torn reads/writes between two threads pinned to different processors, if the system is cache coherent?

If you have two threads in the same processor, you can have a torn read/write.
For example, on a 32 bit system with thread 1 and thread 2 running on the same core:
Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
Thread 2 reads X as 0xffffffff00000000
Thread 1 writes the last 32 bits.
The torn read happens in step 3.
But what if the following conditions are met:
Thread 1 and Thread 2 are pinned to different cores
The system uses MESI protocol to achieve cache coherence
In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?

Yes, you can have tearing.
A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.
Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.
The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.
(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)
Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.
If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?
By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.
A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.
But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)
Further reading:
Atomicity on x86 - how exactly a single load or store can be atomic
Why is integer assignment on a naturally aligned variable atomic on x86?
Can num++ be atomic for 'int num'? - how atomic RMWs interact with MESI. (For x86-style single instructions like lock add [mem], eax. LL/SC machines just detect that they lost control of the cache line in there somewhere and report failure.)

Memory Protection Keys Memory Reordering

Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?

I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.

multicore x86 read or write of misaligned data

On an x86, suppose I have a misaligned data item that spans a cache line boundary, say addresses 0x1fff through 0x2003 containing the little-endian 32-bit value 0x11223344. If thread A on core A does a write of 0x55667788 to that address, and thread B on core B "simultaneously" does a read of the same address, can that thread B potentially read a mix of the old and new value?
In other words, since A's misaligned write is going to be broken up by the processor into a one-byte write of 0x88 to address 0x1fff and a three-byte write of 0x556677 to address 0x2000, is it possible that B's read might happen in the middle of that misaligned write, and wind up reading 0x11223388 (or, if the write is split up in the reverse order, 0x55667711)? Obviously the desirable behavior is for the read to return either the old value or the new one, and I don't care which, but not a mixture.
Ideally I'm looking for not just an answer to the question, but an authoritative citation of specific supporting statements in the Intel or AMD architecture manuals.
I'm writing a simulator for a multiprocessor system which had an exotic processor architecture, and in that system there are strong guarantees of memory access atomicity even for misaligned data, so the scenario I describe can't happen. If I simulate each CPU as a separate thread on the x86, I need to ensure that it can't happen on the x86 either. The information I've read about memory access ordering guarantees on the x86 doesn't explicitly cover misaligned cases.

I posed the question because my attempt at testing it myself didn't turn up any instances in which the mixed read occurred. However, that turns out to be due to a bug in my test program, and once I fixed that, it happens all the time on an AMD FX-8350. On the other hand, if the misaligned data does not cross a cache line boundary, the problem does not seem to occur.
It appears that guaranteeing atomicity of misaligned reads and writes in my simulator will require either explicit locking or transactional memory (e.g., Intel's RTM).
My test program source code in C using pthreads is at:
https://gist.github.com/brouhaha/62f2178d12ec04a81078

Assumption on machine instruction in the sense of multi-thread

Can I assume that each instruction is observed atomicity? for example,
mov dword ptr [eax], 0
The movement either move successfully or doesn't happen. there is no interrupt in the middle of the instruction executed.
Is my assumption right?
I know the current processor can execute instruction out of order, and compiles will generate codes out of order in terms of optimization. And hence the movement will be executed in the order which isn't same as I wrote. But this doesn't matter, what i concern is once the single instruction is executed, it can't be interrupted.
EDIT:
what I concern is atomicity of any single instruction, not an special instruction or a branch of read-write instructions, I just use mov for example.
Any time when any core of processor executes an instruction(add,mov,shift, etc.),
can the execution be interrupted?
Is there any indeterminate state in the register or the memory(machine-word-size).
Or what is the smallest unit that hardware can provide atomicity?

NO. You generally should not assume that instructions are atomic. With regard to loading a register with a constant why would that matter anyways? Are you asking if the register can end up in an indeterminate state? The answer to that is no, otherwise interrupts wouldn't work. The register would either be loaded or not loaded from the view point of a program running on the same core.
The LOCK prefix in x86 is there to ensure atomicity.
EDIT: Question has been edited to show storing a constant into memory.
And my answer is still generally no. There may be some situations where if the memory is aligned, and the CPU makes that guarantee this will be atomic but I wouldn't rely on it as it could get you into trouble.
Also see here:
Read/Write an int on x86 machine without lock

You've tagged this question C and C++, but neither language has a mov instruction or any notion of individual instructions. If your question is really about x86 assembly, then yes mov instructions are atomic, at least as long as the memory operand is aligned, and probably even if it's not. Note however that this has nothing to do with whether assignment to C or C++ variables is atomic.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?

Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.

In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.

An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)

It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.

The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.

Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string