How to decide when to use memory barrier

How to decide when to use memory barrier - linux

As part of writing driver code, i have come across codes which uses memory barrier (fencing). After reading and surfing through Google, learnt as to why it is used and helpful in SMP. Thinking through this, in multi threaded programming we would find many instances where there are memory races and putting barrier in all places would cost system CPU. I was wondering how to:
I know about specific code path which use common memory to access data, do I need memory barrier in all these places?
Any specific technique or tip which will help me identify this pitfalls?
This is very generic questions but wanted to get insight on others experiences and any tips which would help to identify such pitfalls.

Often device hardware is sensitive to the order in which device registers are written. Modern systems are weakly-coupled and typically have write-combining hardware between the CPU and memory.
Suppose you write a single byte of a 32-bit object. What is in the write-combining hardware is now A _ _ _. Instead of immediately initiating a read/modify/write cycle to update the A byte, the hardware sets a timer. The hope is that the CPU will send the B, C, and D bytes before the timer expires. The timer expires, the data in the write-combining register gets dumped into memory.
Setting a barrier causes the write-combining hardware to use what it has. If only slot A is occupied then only slot A gets written.
Now supposed the hardware expected the bytes to be written in the strict order A, C, B, D. Without precautions the hardware registers get written in the wrong order. The result is what you expect: Ready! Fire! Aim!
Barriers should be placed judiciously because their incorrect use can seriously impede performance. Not every device write needs a barrier; judgement is called for.

Related

Why memory reordering is not a problem on single core/processor machines?

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?

The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.

The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

For emulating the Gameboy, why does it matter that the memory is broken up into different areas?

So I'm writing a gameboy emulator, and I'm not 100% sure why other projects took the time to break up the memory into proper categories. I don't know if there is a major technical dilemma I'm missing (maybe handling illegal parameters in instructions?), but it seems like the only thing that matters is that the address given by a write instruction is retrievable by the proper read instruction. So for a sub question, if I'm working under the assumption that the assembly is perfectly legal (meaning nothing is trying to read/write where it can't), can I just make a big array and read and write to it?
Note that this is a conceptual question and that I am aware a big array would be a memory hog, I'm not necessarily looking for the best way to do it, simply trying to learn how it works and why other emulator developers did it the way they did.

You are going to have read only memory, read/write memory and memory mapped I/O (peripherals etc). So you need to decode the address to some extent to break it into the major categories, then for the peripherals you have to emulate all of those so you have to get very detailed in your address decoding.
For the peripherals you will need to detect a read/write to some address which you cannot do by simply landing the writes in an array (two writes of the same value for example make a difference, you cant just scan some array to look for changes you have to trigger on reads and writes and perform the hardware action).
If you wish to be cycle accurate you will also need to know the timings for the rams and roms in order to mimic those, depending on how many banks of each or if timing is dependent on that you will need to decode the address further.
Hardware decodes these addresses to the same level so if you are emulating hardware then you need to...emulate hardware...and do the same amount of address decoding.

I'm going to be gameboy specific here. Look at gameboy's address space map. The address space itself is divided, it's not that emulators do it. Hardware itself operates that way.
Here's some of the regions that can't be implemented as just an array:
0x0000-0x3FFF. First bank of a ROM. It's read-only but not quite. Read the next one
0x4000-0x7FFF. Switchable ROM bank, it's also not quite read-only. Cartridges that don't fit into gameboy's address space contain memory bank controller. ROM will write to some specific read-only ROM regions to actually select which ROM bank is mapped into 0x4000-0x7FFF address range. So you have to detect these writes and then redirect reads into the selected ROM bank.
0xA000-0xBFFF. Switchable RAM bank. Same thing as with switchable ROM banks but now for RAM. Cartridges may contain additional RAM that's being mapped into gameboy's address space. Which bank of the RAM is mapped is controlled, again, by writes to specific read-only regions.
0xFF00-0xFF4B. IO ports. Here you have hardware registers mapped into address space. Gameboy has several hardware components each with it's own registers and even memory (display controller, sound processor, timers etc). To control that hardware ROM reads and writes into the IO ports. You obviously have to detect these writes so you can emulate the hardware they correspond to. It's not just CPU and memory you have to emulate. I would even say that the least part of it and the easy one. For example, it much harder to get display controller and sound channels right. They have complicated logic, bugs and very tricky behaviour that's not documented very well but is crucial to achieve accurate emulation. Wave channel in particular gave me a hard time.

multicore x86 read or write of misaligned data

On an x86, suppose I have a misaligned data item that spans a cache line boundary, say addresses 0x1fff through 0x2003 containing the little-endian 32-bit value 0x11223344. If thread A on core A does a write of 0x55667788 to that address, and thread B on core B "simultaneously" does a read of the same address, can that thread B potentially read a mix of the old and new value?
In other words, since A's misaligned write is going to be broken up by the processor into a one-byte write of 0x88 to address 0x1fff and a three-byte write of 0x556677 to address 0x2000, is it possible that B's read might happen in the middle of that misaligned write, and wind up reading 0x11223388 (or, if the write is split up in the reverse order, 0x55667711)? Obviously the desirable behavior is for the read to return either the old value or the new one, and I don't care which, but not a mixture.
Ideally I'm looking for not just an answer to the question, but an authoritative citation of specific supporting statements in the Intel or AMD architecture manuals.
I'm writing a simulator for a multiprocessor system which had an exotic processor architecture, and in that system there are strong guarantees of memory access atomicity even for misaligned data, so the scenario I describe can't happen. If I simulate each CPU as a separate thread on the x86, I need to ensure that it can't happen on the x86 either. The information I've read about memory access ordering guarantees on the x86 doesn't explicitly cover misaligned cases.

I posed the question because my attempt at testing it myself didn't turn up any instances in which the mixed read occurred. However, that turns out to be due to a bug in my test program, and once I fixed that, it happens all the time on an AMD FX-8350. On the other hand, if the misaligned data does not cross a cache line boundary, the problem does not seem to occur.
It appears that guaranteeing atomicity of misaligned reads and writes in my simulator will require either explicit locking or transactional memory (e.g., Intel's RTM).
My test program source code in C using pthreads is at:
https://gist.github.com/brouhaha/62f2178d12ec04a81078

Userspace vs kernel space driver

I am looking to write a PWM driver. I know that there are two ways we can control a hardware driver:
User space driver.
Kernel space driver
If in general (do not consider a PWM driver case) we have to make a decision whether to go for user space or kernel space driver. Then what factors we have to take into consideration apart from these?
User space driver can directly mmap() /dev/mem memory to their virtual address space and need no context switching.
Userspace driver cannot have interrupt handlers implemented (They have to poll for interrupt).
Userspace driver cannot perform DMA (As DMA capable memory can be allocated from kernel space).

From those three factors that you have listed only the first one is actually correct. As for the rest — not really. It is possible for a user space code to perform DMA operations — no problem with that. There are many hardware appliance companies who employ this technique in their products. It is also possible to have an interrupt driven user-space application, even when all of the I/O is done with a full kernel-bypass. Of course, it is not as easy simply doing an mmap() on /dev/mem.
You would have to have a minimal portion of your driver in the kernel — that is needed in order to provide your user space with a bare minimum that it needs from the kernel (because if you think about it — /dev/mem is also backed up by a character device driver).
For DMA, it is actually too darn easy — all you have to do is to handle mmap request and map a DMA buffer into the user space. For interrupts — it is a little bit more tricky, the interrupt must be handled by the kernel no matter what, however, the kernel may not do any work and just wake up the process that calls, say, epoll_wait(). Another approach is to deliver a signal to the process as done by DOSEMU, but that is very slow and is not recommended.
As for your actual question, one factor that you should take into consideration is resource sharing. As long as you don't have to share a device across multiple applications and there is nothing that you cannot do in user space — go for the user space. You will probably save tons of time during the development cycle as writing user space code is extremely easy. When, however, two or more applications need to share the device (or its resources) then chances are that you will spend tremendous amount of time making it possible — just imagine multiple processes forking, crashing, mapping (the same?) memory concurrently etc. And after all, IPC is generally done through the kernel, so if application would need to start "talking" to each other, the performance might degrade greatly. This is still done in real-life for certain performance-critical applications, though, but I don't want to go into those details.
Another factor is the kernel infrastructure. Let's say you want to write a network device driver. That's not a problem to do it in user space. However, if you do that then you'd need to write a full network stack too as it won't be possible to user Linux's default one that lives in the kernel.
I'd say go for user space if it is possible and the amount of effort to make things work is less than writing a kernel driver, and keeping in mind that one day it might be necessary to move code into the kernel. In fact, this is a common practice to have the same code being compiled for both user space and kernel space depending on whether some macro is defined or not, because testing in user space is a lot more pleasant.

Another consideration: it is far easier to debug user-space drivers. You can use gdb, valgrind, etc. Heck, you don't even have to write your driver in C.
There's a third option beyond just user space or kernel space drivers: some of both. You can do just the kernel-space-only stuff in a kernel driver and do everything else in user space. You might not even have to write the kernel space driver if you use the Linux UIO driver framework (see https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html).
I've had luck writing a DMA-capable driver almost completely in user space. UIO provides the infrastructure so you can just read/select/epoll on a file to wait on an interrupt.
You should be cognizant of the security implications of programming the DMA descriptors from user space: unless you have some protection in the device itself or an IOMMU, the user space driver can cause the device to read from or write to any address in physical memory.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?

Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.

In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.

An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)

It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.

The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.

Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string