Hazards of not protection shared variables in a threaded environment - multithreading

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?

Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.

In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.

An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)

It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.

The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.

Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Related

Can you have torn reads/writes between two threads pinned to different processors, if the system is cache coherent?

If you have two threads in the same processor, you can have a torn read/write.
For example, on a 32 bit system with thread 1 and thread 2 running on the same core:
Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
Thread 2 reads X as 0xffffffff00000000
Thread 1 writes the last 32 bits.
The torn read happens in step 3.
But what if the following conditions are met:
Thread 1 and Thread 2 are pinned to different cores
The system uses MESI protocol to achieve cache coherence
In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?
Yes, you can have tearing.
A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.
Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.
The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.
(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)
Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.
If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?
By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.
A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.
But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)
Further reading:
Atomicity on x86 - how exactly a single load or store can be atomic
Why is integer assignment on a naturally aligned variable atomic on x86?
Can num++ be atomic for 'int num'? - how atomic RMWs interact with MESI. (For x86-style single instructions like lock add [mem], eax. LL/SC machines just detect that they lost control of the cache line in there somewhere and report failure.)

Understanding how bootmem works

I have been studying OS concepts and decided to look in how these stuff are actually implemented in Linux. But I am having problem understanding some thing in relation with memory management during boot process before page_allocator is turned on, more precisely how bootmem works. I do not need the exact workings of it, but just an understanding how some things are/can be solved.
So obviously, bootmem cannot use dynamic memory, meaning that size he has must be known before runtime, so appropriate steps can be taken, i.e. the maximum size of his bitmap must be known in advance. From what I understand, this is most likely solved by simply mapping enough memory during kernel initialization, if architecture changes, simply change the size of the mapped memory. Obviously, there is probably a lot more going on, but I guess I got the general idea? However, what really makes no sense to me is NUMA architecture. Everywhere I read, it says that pg_data_t is created for each memory node. This pg_data is put into a list(how can it know the size of the list? Or is the size fixed for specific arch?) and for each node, bitmap is allocated. So, basically, it sounds like it can create undefined number of these pg_data, each of which has their memory bitmap of arbitrary size. How? What am I missing?
EDIT: Sorry for not including reference. Here is bootmem code, it can also be found in mm/bootmem.c: http://lxr.free-electrons.com/source/mm/bootmem.c
It is architecture-dependent. On the x86 architecture, early on in the boot process the kernel does issue one BIOS call - the 0xe820 function of the trap at Interrupt Vector 0x15. This returns a memory map that the kernel can use to build it's memory tables, including holes for non-memory (PCI or ISA) devices, etc. Bootloaders (before the kernel) will do the same.
See: Detecting Memory
After looking into this more, I think it works this way: basically, all the necessary things are statically allocated, i.e. by using preprocessor DEFINES it is ensured certain sections of bootmem (as well as other parts of the kernel) code either exist or do not exist in the compiled code for specific architecture (even though the code itself is architecture-independent). These DEFINES are specified in architecture-dependent sources codes found under arch/ (e.g. arch/i386, arch/arm/, etc.). For NUMA architectures there is a define called MAX_NUMNODES, ensuring that list of structs (more specifically, list of pg_data_t structures) representing nodes is allocated as a static array (which is then treated as a list). The bitmaps representing memory map are obviously relatively small, since each page is represented as only one bit, taking up KBs, or maybe MBs. Whatever the case, architecture dependent head.S sets-up all the necessary structures needed for system functioning (like page-tables) and ensure enough physical memory is mapped to virtual so these bitmaps can fit in it without causing page-fault (in case of x86 arch, initial 8MB of RAM is mapped, which is more then enough for both kernel and additional structures, like bitmaps).

multicore x86 read or write of misaligned data

On an x86, suppose I have a misaligned data item that spans a cache line boundary, say addresses 0x1fff through 0x2003 containing the little-endian 32-bit value 0x11223344. If thread A on core A does a write of 0x55667788 to that address, and thread B on core B "simultaneously" does a read of the same address, can that thread B potentially read a mix of the old and new value?
In other words, since A's misaligned write is going to be broken up by the processor into a one-byte write of 0x88 to address 0x1fff and a three-byte write of 0x556677 to address 0x2000, is it possible that B's read might happen in the middle of that misaligned write, and wind up reading 0x11223388 (or, if the write is split up in the reverse order, 0x55667711)? Obviously the desirable behavior is for the read to return either the old value or the new one, and I don't care which, but not a mixture.
Ideally I'm looking for not just an answer to the question, but an authoritative citation of specific supporting statements in the Intel or AMD architecture manuals.
I'm writing a simulator for a multiprocessor system which had an exotic processor architecture, and in that system there are strong guarantees of memory access atomicity even for misaligned data, so the scenario I describe can't happen. If I simulate each CPU as a separate thread on the x86, I need to ensure that it can't happen on the x86 either. The information I've read about memory access ordering guarantees on the x86 doesn't explicitly cover misaligned cases.
I posed the question because my attempt at testing it myself didn't turn up any instances in which the mixed read occurred. However, that turns out to be due to a bug in my test program, and once I fixed that, it happens all the time on an AMD FX-8350. On the other hand, if the misaligned data does not cross a cache line boundary, the problem does not seem to occur.
It appears that guaranteeing atomicity of misaligned reads and writes in my simulator will require either explicit locking or transactional memory (e.g., Intel's RTM).
My test program source code in C using pthreads is at:
https://gist.github.com/brouhaha/62f2178d12ec04a81078

Writing to adjacent array elements from different threads?

Are there any modern, common CPUs where it is unsafe to write to adjacent elements of an array concurrently from different threads? I'm especially interested in x86. You may assume that the compiler doesn't do anything obviously ridiculous to increase memory granularity, even if it's technically within the standard.
I'm interested in the case of writing arbitrarily large structs, not just native types.
Note:
Please don't mention the performance issues with regard to false sharing. I'm well aware of these, but they're of no practical importance for my use cases. I'm also aware of visibility issues with regard to data written from threads other than the reader. This is addressed in my code.
Clarification: This issue came up because on some processors (for example, old DEC Alphas) memory could only be addressed at word level. Therefore, writing to memory in non-word size increments (for example, single bytes) actually involved read-modify-write of the byte to be written plus some adjacent bytes under the hood. To visualize this, think about what's involved in writing to a single bit. You read the byte or word in, perform a bitwise operation on the whole thing, then write the whole thing back. Therefore, you can't safely write to adjacent bits concurrently from different threads.
It's also theoretically possible, though utterly silly, for a compiler to implement memory writes this way when the hardware doesn't require it. x86 can address single bytes, so it's mostly not an issue, but I'm trying to figure out if there's any weird corner case where it is. More generally, I want to know if writing to adjacent elements of an array from different threads is still a practical issue or mostly just a theoretical one that only applies to obscure/ancient hardware and/or really strange compilers.
Yet another edit: Here's a good reference that describes the issue I'm talking about:
http://my.safaribooksonline.com/book/programming/java/0321246780/threads-and-locks/ch17lev1sec6
Writing a native sized value (i.e. 1, 2, 4, or 8 bytes) is atomic (well, 8 bytes is only atomic on 64-bit machines). So, no. Writing a native type will always write as expected.
If you're writing multiple native types (i.e. looping to write an array) then it's possible to have an error if there's a bug in the operating system kernel or an interrupt handler that doesn't preserve the required registers.
Yes, definitely, writing a mis-aligned word that straddles the CPU cache line boundary is not atomic.

Does one assembler instruction always execute atomically? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 2 years ago.
Today I came across this question:
you have a code
static int counter = 0;
void worker() {
for (int i = 1; i <= 10; i++)
counter++;
}
If worker would be called from two different threads, what value will counter have after both of them are finished?
I know that actually it could be anything. But my internal guts tells me, that counter++ will most likely be translated into single assembler instruction, and if both threads are execute on the same core, counter will be 20.
But what if those threads are run on different cores or processors, could there be a race condition in their microcode? Is one assembler instruction could always be viewed as atomic operation?
Specifically for x86, and regarding your example: counter++, there are a number of ways it could be compiled. The most trivial example is:
inc counter
This translates into the following micro operations:
load counter to a hidden register on the CPU
increment the register
store the updated register in counter
This is essentially the same as:
mov eax, counter
inc eax
mov counter, eax
Note that if some other agent updates counter between the load and the store, it won't be reflected in counter after the store. This agent could be another thread in the same core, another core in the same CPU, another CPU in the same system, or even some external agent that uses DMA (Direct Memory Access).
If you want to guarantee that this inc is atomic, use the lock prefix:
lock inc counter
lock guarantees that nobody can update counter between the load and the store.
Regarding more complicated instructions, you usually can't assume that they'll execute atomically, unless they support the lock prefix.
The answer is: it depends!
Here is some confusion around, what an assembler instruction is. Normally, one assembler instruction is translated into exactly one machine instruction. The excemption is when you use macros -- but you should be aware of that.
That said, the question boils down is one machine instruction atomic?
In the good old days, it was. But today, with complex CPUs, long running instructions, hyperthreading, ... it is not. Some CPUs guarantee that some increment/decrement instructions are atomic. The reason is, that they are neat for very simple syncronizing.
Also some CPU commands are not so problematic. When you have a simple fetch (of one piece of data that the processor can fetch in one piece) -- the fetch itself is of course atomic, because there is nothing to be divided at all. But when you have unaligned data, it becomes complicated again.
The answer is: It depends. Carefully read the machine instruction manual of the vendor. In doubt, it is not!
Edit:
Oh, I saw it now, you also ask for ++counter. The statement "most likely to be translated" can not be trusted at all. This largely depends also on the compiler of course! It gets more difficult when the compiler is making different optimizations.
Not always - on some architectures one assembly instruction is translated into one machine code instruction, while on others it does not.
In addition - you can never assume that the program language you are using is compiling a seemingly simple line of code into one assembly instruction. Moreover, on some architectures, you cannot assume one machine code will execute atomically.
Use proper synchronization techniques instead, dependent on the language you are coding in.
Increment/decrement operations on 32-bit or less integer variables on a single 32-bit processor with no Hyper-Threading Technology are atomic.
On a processor with Hyper-Threading Technology or on a multi-processor system, the increment/decrement operations are NOT guaranteed to be executed atomicaly.
Invalidated by Nathan's comment:
If I remember my Intel x86 assembler correctly, the INC instruction only works for registers and does not directly work for memory locations.
So a counter++ would not be a single instruction in assembler (just ignoring the post-increment part). It would be at least three instructions: load counter variable to register, increment register, load register back to counter. And that is just for x86 architecture.
In short, don't rely on it being atomic unless it is specified by the language specification and that the compiler that you are using supports the specifications.
Another issue is that if you don't declare the variable as volatile, the code generated would probably not update the memory at every loop iteration, only at the end of the loop the memory would be updated.
No you cannot assume this. Unless it clearly stated in compiler specification. And moreover no one can guarantee that one single assembler instruction indeed atomic. In practice each assembler instruction is translated to number of microcode operation - uops.
Also the issue of race condition is tightly coupled with memory model(coherence, sequential, release coherence and etc.), for each one the answer and result could be different.
In most cases, no. In fact, on x86, you can perform the instruction
push [address]
which, in C, would be something like:
*stack-- = *address;
This performs two memory transfers in one instruction.
That's basically impossible to do in 1 clock cycle, not the least because one memory transfer is also not possible in one cycle!
Might not be an actual answer to your question, but (assuming this is C#, or another .NET language) if you want counter++ to really be multi-threaded atomic, you could use System.Threading.Interlocked.Increment(counter).
See other answers for actual information on the many different ways why/how counter++ could not be atomic. ;-)
On many other processors, the seperation between memory system and processor is bigger. (often these processor can be little or big-endian depending on memory system, like ARM and PowerPC), this also has consequences for atomic behaviour if the memory system can reorder reads and writes.
For this purpose, there are memory barriers (http://en.wikipedia.org/wiki/Memory_barrier)
So in short, while atomic instructions are enough on intel (with the relevant lock prefixes), more must be done on non-intel, since the memory I/O might not be in the same order.
This is a known problem when porting "lock-free" solutions from Intel to other architectures.
(Note that multiprocessor (not multicore) systems on x86 also seem to need memory barriers, at least in 64-bit mode.
I think that you'll get a race condition on access.
If you wanted to ensure an atomic operation in incrementing counter then you'd need to use ++counter.

Resources