multicore x86 read or write of misaligned data - multithreading

On an x86, suppose I have a misaligned data item that spans a cache line boundary, say addresses 0x1fff through 0x2003 containing the little-endian 32-bit value 0x11223344. If thread A on core A does a write of 0x55667788 to that address, and thread B on core B "simultaneously" does a read of the same address, can that thread B potentially read a mix of the old and new value?
In other words, since A's misaligned write is going to be broken up by the processor into a one-byte write of 0x88 to address 0x1fff and a three-byte write of 0x556677 to address 0x2000, is it possible that B's read might happen in the middle of that misaligned write, and wind up reading 0x11223388 (or, if the write is split up in the reverse order, 0x55667711)? Obviously the desirable behavior is for the read to return either the old value or the new one, and I don't care which, but not a mixture.
Ideally I'm looking for not just an answer to the question, but an authoritative citation of specific supporting statements in the Intel or AMD architecture manuals.
I'm writing a simulator for a multiprocessor system which had an exotic processor architecture, and in that system there are strong guarantees of memory access atomicity even for misaligned data, so the scenario I describe can't happen. If I simulate each CPU as a separate thread on the x86, I need to ensure that it can't happen on the x86 either. The information I've read about memory access ordering guarantees on the x86 doesn't explicitly cover misaligned cases.

I posed the question because my attempt at testing it myself didn't turn up any instances in which the mixed read occurred. However, that turns out to be due to a bug in my test program, and once I fixed that, it happens all the time on an AMD FX-8350. On the other hand, if the misaligned data does not cross a cache line boundary, the problem does not seem to occur.
It appears that guaranteeing atomicity of misaligned reads and writes in my simulator will require either explicit locking or transactional memory (e.g., Intel's RTM).
My test program source code in C using pthreads is at:
https://gist.github.com/brouhaha/62f2178d12ec04a81078

Related

Can you have torn reads/writes between two threads pinned to different processors, if the system is cache coherent?

If you have two threads in the same processor, you can have a torn read/write.
For example, on a 32 bit system with thread 1 and thread 2 running on the same core:
Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
Thread 2 reads X as 0xffffffff00000000
Thread 1 writes the last 32 bits.
The torn read happens in step 3.
But what if the following conditions are met:
Thread 1 and Thread 2 are pinned to different cores
The system uses MESI protocol to achieve cache coherence
In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?
Yes, you can have tearing.
A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.
Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.
The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.
(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)
Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.
If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?
By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.
A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.
But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)
Further reading:
Atomicity on x86 - how exactly a single load or store can be atomic
Why is integer assignment on a naturally aligned variable atomic on x86?
Can num++ be atomic for 'int num'? - how atomic RMWs interact with MESI. (For x86-style single instructions like lock add [mem], eax. LL/SC machines just detect that they lost control of the cache line in there somewhere and report failure.)

Memory Protection Keys Memory Reordering

Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.

How to decide when to use memory barrier

As part of writing driver code, i have come across codes which uses memory barrier (fencing). After reading and surfing through Google, learnt as to why it is used and helpful in SMP. Thinking through this, in multi threaded programming we would find many instances where there are memory races and putting barrier in all places would cost system CPU. I was wondering how to:
I know about specific code path which use common memory to access data, do I need memory barrier in all these places?
Any specific technique or tip which will help me identify this pitfalls?
This is very generic questions but wanted to get insight on others experiences and any tips which would help to identify such pitfalls.
Often device hardware is sensitive to the order in which device registers are written. Modern systems are weakly-coupled and typically have write-combining hardware between the CPU and memory.
Suppose you write a single byte of a 32-bit object. What is in the write-combining hardware is now A _ _ _. Instead of immediately initiating a read/modify/write cycle to update the A byte, the hardware sets a timer. The hope is that the CPU will send the B, C, and D bytes before the timer expires. The timer expires, the data in the write-combining register gets dumped into memory.
Setting a barrier causes the write-combining hardware to use what it has. If only slot A is occupied then only slot A gets written.
Now supposed the hardware expected the bytes to be written in the strict order A, C, B, D. Without precautions the hardware registers get written in the wrong order. The result is what you expect: Ready! Fire! Aim!
Barriers should be placed judiciously because their incorrect use can seriously impede performance. Not every device write needs a barrier; judgement is called for.

How do modern cpus handle crosspage unaligned access?

I'm trying to understand how unaligned memory access (UMA) works on modern processors (namely x86-64 and ARM architectures). I get that I might run into problems with UMA ranging from performance degradation till CPU fault. And I read about posix_memalign and cache lines.
What I cannot find is how the modern systems/hardware handle the situation when my request exceeds page boundaries?
Here is an example:
I malloc() an 8KB chunk of memory.
Let's say that malloc() doesn't have enough memory and sbrk()s 8KB for me.
The kernel gets two memory pages (4KB each) and maps them into my process's virtual address space (let's say that these two pages are not one after another in memory
movq (offset + $0xffc), %rax I request 8 bytes starting at the 4092th byte, meaning that I want 4 bytes from the end of the first page and 4 bytes from the beginning of the second page.
Physical memory:
---|---------------|---------------|-->
|... 4b| | |4b ...|-->
I need 8 bytes that are split at the page boundaries.
How do MMU on x86-64 and ARM handle this? Are there any mechanisms in kernel MM to somehow prepare for this kind of request? Is there some kind of protection in malloc? What do processors do? Do they fetch two pages?
I mean to complete such request MMU has to translate one virtual address to two physical addresses. How does it handle such request?
Should I care about such things if I'm a software programmer and why?
I'm reading a lot of links from google, SO, drepper's cpumemory.pdf and gorman's Linux VMM book at the moment. But it's an ocean of information. It would be great if you at least provide me with some pointers or keywords that I could use.
I'm not overly familiar with the guts of the Intel architecture, but the ARM architecture sums this specific detail up in a single bullet point under "Unaligned data access restrictions":
An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.
So other than the potential to generate two page faults from a single operation, it's just another unaligned access. Of course, that still assumes all the caveats of "just another unaligned access" - namely it's only valid on normal (not device) memory, only for certain load/store instructions, has no guarantee of atomicity and may be slow - the microarchitecture will likely synthesise an unaligned access out of multiple aligned accesses1, which means multiple MMU translations, potentially multiple cache misses if it crosses a line boundary, etc.
Looking at it the other way, if an unaligned access doesn't cross a page boundary, all that means is that if the aligned address for the first "sub-access" translates OK, the aligned addresses of any subsequent parts are sure to hit in the TLB. The MMU itself doesn't care - it just translates some addresses that the processor gives it. The kernel doesn't even come into the picture unless the MMU raises a page fault, and even then it's no different from any other page fault.
I've had a quick skim through the Intel manuals and their answer hasn't jumped out at me - however in the "Data Types" chapter they do state:
[...] the processor requires two memory accesses to make an unaligned access; aligned accesses require only one memory access.
so I'd be surprised if wasn't broadly the same (i.e. one translation per aligned access).
Now, this is something most application-level programmers shouldn't have to worry about, provided they behave themselves - outside of assembly language, it's actually quite hard to make unaligned accesses happen. The likely culprits are type-punning pointers and messing with structure packing, both things that 99% of the time one has no reason to go near, and for the other 1% are still almost certainly the wrong thing to do.
[1] The ARM architecture pseudocode actually specifies unaligned accesses as a series of individual byte accesses, but I'd expect implementations actually optimise this into larger aligned accesses where appropriate.
So the architecture doesnt really matter other than x86 has traditionally not directly told you not to where mips and arm traditionally generate a data abort rather than trying to just make it work.
where it doesnt matter is that all processors have a fixed number of pins a fixed size (maximum) data bus a fixed size (max) address bus, "modern processors" tend to have data busses more than 8 bits wide but the units on addresses is still an 8 bit byte, so the opportunity for unaligned exists. Anything larger than one byte in a particular transfer has the opportunity of being unaligned if the architecture allows.
Transfers are typically in some units of bytes and/or bus widths. On an ARM amba/axi bus for example the length field is in units of bus widths, 32 or 64 bits, 4 or 8 bytes. And no it is not going to be in units of 4Kbytes....
(yes this is elementary I assume you understand all of this).
Whether it is 16 bits or 128 bits, the penalty for unaligned comes from the additional bus cycles which these days is an extra bus clock per. So for an ARM 16 bit unaligned transfer (which arm will support on its newer cores without faulting) that means you need to read 128 bits instead of 64, 64 bits to get 16 is not a penalty as 64 is the smallest size for a bus transfer. Each transfer whether it is a single width of the data bus or multiple has multiple clock cycles associated with it, lets say there are 6 clock cycles to do an aligned 16 bit read, then ideally it is 7 cycles to do an unaligned 16 bit. Seems small but it does add up.
caches help alot because the dram side of the cache will be setup to use multiples of the bus width and will always do aligned accesses for cache fetches and evictions. not-cached accesses will follow the same pain except the dram side is not handfuls of clocks but dozens to hundreds of clocks of overhead.
For random access a single 16 bit read that not only spans a bus width boundary but also happens to cross a cache line boundary will not just incur the one additional clock on the processor side but worst case it can incur an addition cache line fetch which is dozens to hundreds of additional clock cycles. if you were walking through an array of things that happen to not be aligned (structures/unions may be an example depending on the compiler and code) that additional cache line fetch would have happened anyway, if the array of things is a little over on one or both ends then you might still incur one or two more cache line fetches that you would have avoided had the array been aligned.
That is really the key to this on reads is before or after an aligned area you might have to incur a transfer for each one for each side you spill into.
Writes are both good and bad. random reads are slower because the transaction has to stall until the answer comes back. For a random write the memory controller has all the information it needs it has the address, data, byte mask, transfer type, etc. So it is fire and forget the processor has done its job and can call the transaction complete from its perspective and move on. Naturally gang too much of these up or do a read on something just written and then the processor stalls due to the completion of a prior write in addition to the current transaction.
An unaligned 16 bit write for example does not only incur the additional read cycle but assuming a 32 or 64 bit wide bus that would be one byte per location so you have to do a read-modify-write on whatever that closest memory is (cache or dram). so depending on how the processor and then memory controller implements it it can be two individual read-modify-write transactions (unlikely since that incurs twice the overhead), or the double width read, modify both parts, and a double width read. incurring two additional clocks over and above the overhead, the overhead is doubled as well. If it had been an aligned bus width write then no read-modify-write is required, you save the read. Now if this read-modify-write is in the cache then that is pretty fast but still noticeable up to a few clocks depending on what is queued up and you have to wait on.
I am also most familiar with ARM. Arm traditionally would punish an unaligned access with an abort, you could turn that off, and you would instead get a rotation of the bus rather than it spilling over which would make for some nice freebie endian swaps. the more modern arm cores will tolerate and implement an unaligned transfer. Understand for example a store multiple of say 4 or more registers against a non-64-bit-aligned address is not considered an unaligned access even though it is a 128 bit write to an address that is neither 64 nor 128 bit aligned. What the processor does in that case is brakes it into 3 writes, an aligned 32 bit write, an aligned 64 bit write and an aligned 32 bit write. the memory controller does not have to deal with the unaligned stuff. That is for legal things like store multiple. the core I am familiar with wont do a write length of more than 2 anyway, an 8 register store multiple, is not a single length of 4 write it is 2 separate length of two writes. But a load multiple of 8 registers, so long it is aligned on a 64 bit address is a single length of 4 transaction. I am pretty sure that since there is no masking on the bus side for a read, everything is in units of bus width, there is no reason to break say a 4 register load multiple on an address that is not 64 bits aligned into 3 transactions, simply do a length of 3 read. When the processor reads a single byte you cant tell that from the bus all you see is a 64 bit read AFAIK. The processor strips the byte lane out. If the processor/bus does care be it arm, x86, mips, etc, then sure you will hopefully see separate transfers.
Does everyone do this? no older processors (not thinking of an arm nor x86) would put more burden on the memory controller. I dont know what modern x86 and mips and such do.
Your malloc example. First off you are not going to see single bus transfers of 4Kbytes, that 4k will be broken up into digestible bits anyway. first off it has to do one to many bus cycles against the memory management unit to find the physical address and other properties anyway (those answers can get cached to make them faster, but sometimes they have to go all the way out to slow dram) so for that example the only transfer that matters is an aligned transfer that splits the 4k boundary, say a 16 bit transfer, for the mmu system to work at all the only way for that to be supported is that has to be turned into two separate 8 bit transfers that happen in those physical address spaces, and yes that literally doubles everything the mmu lookup cycles the cache/dram bus cycles, etc. Other than that boundary there is nothing special about your 8k being split. the bulk of your cycles will be within one of the two 4k pages, so it looks like any other random access, with of course repetitive/sequential accesses gaining the benefit of caching.
The short answer is that no matter what platform you are on either 1) the platform will abort an unaligned transfer, or 2) somewhere in the path there is an additional one or more (dozens/hundreds) as a result of the unaligned access compared to an aligned access.
It doesn't matter whether the physical pages are adjacent or not. Modern CPUs use caches. Data is transferred to/from DRAM a full cache-line at a time. Thus, DRAM will never see a multi-byte read or write that crosses a 64B boundary, let alone a page boundary.
Stores that cross a page boundary are still slow (on modern x86). I assume the hardware handles the page-split case by detecting it at some later pipeline stage, and triggering a re-do that does two TLB checks. IDK if Intel designs insert extra uops into the pipeline to handle it, or what. (i.e. impact on latency, throughput of page-splits, throughput of all memory accesses, throughput of other (e.g. non-memory) uops).
Normally there's no penalty at all for unaligned accesses within a cache-line (since about Nehalem), and a small penalty for cache-line splits that aren't page-splits. An even split is apparently cheaper than others. (e.g. a 16B load that takes 8B from one cache line and 8B from another).
Anyway, DRAM will never see an unaligned access directly. AFAIK, no sane modern design has only write-through caches, so DRAM only sees writes when a cache-line is flushed, at which point the fact that one unaligned access dirtied two cache lines is not available. Caches don't even record which bytes are dirty; they just burst-write the whole 64B to the next level down (or last-level to DRAM) when needed.
There are probably some CPU designs that don't work this way, but Intel and AMD's designs are also this way.
Caveat: loads/stores to uncachable memory regions might produce smaller stores, but probably still only within a single cache-line. (On x86, this prob. applies to MOVNT non-temporal stores that use write-combining store buffers but otherwise bypass the cache).
Uncacheable unaligned stores that cross a page boundary are probably still split into separate stores (because each part needs a separate TLB translation).
Caveat 2: I didn't fact-check this. I'm certain about the whole-cache-line aligned access to DRAM for "normal" loads/stores to "normal" memory regions, though.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?
Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.
In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.
An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)
It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.
The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.
Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Resources