memory barrier and atomic_t on linux - linux

Recently, I am reading some Linux kernel space codes, I see this
uint64_t used;
uint64_t blocked;
used = atomic64_read(&g_variable->used); //#1
barrier(); //#2
blocked = atomic64_read(&g_variable->blocked); //#3
What is the semantics of this code snippet? Does it make sure #1 executes before #3 by #2.
But I am a litter bit confused, becasue
#A In 64 bit platform, atomic64_read macro is expanded to
used = (&g_variable->used)->counter // where counter is volatile.
In 32 bits platform, it was converted to use lock cmpxchg8b. I assume these two have the same semantic, and for 64 bits version, I think it means:
all-or-nothing, we can exclude case where address is unaligned and word size large than CPU's native word size.
no optimization, force CPU read from memory location.
atomic64_read doesn't have semantic for preserve read ordering!!! see this
#B the barrier macro is defined as
/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")
From the wiki this just prevents gcc compiler from reordering read and write.
What i am confused is how does it disable reorder optimization for CPU? In addition, can i think barrier macro is full fence?

32-bit x86 processors don't provide simple atomic read operations for 64-bit types. The only atomic operation on 64-bit types on such CPUs that deals with "normal" registers is LOCK CMPXCHG8B, which is why it is used here. The alternative is to use MOVQ and MMX/XMM registers, but that requires knowledge of the FPU state/registers, and requires that all operations on that value are done with the MMX/XMM instructions.
On 64-bit x86_64 processors, aligned reads of 64-bit types are atomic, and can be done with a MOV instruction, so only a plain read is required --- the use of volatile is just to ensure that the compiler actually does a read, and doesn't cache a previous value.
As for the read ordering, the inline assembler you quote ensures that the compiler emits the instructions in the right order, and this is all that is required on x86/x86_64 CPUs, provided the writes are correctly sequenced. LOCKed writes on x86 have a total ordering; plain MOV writes provide "causal consistency", so if thread A does x=1 then y=2, if thread B reads y==2 then a subsequent read of x will see x==1.
On IA-64, PowerPC, SPARC, and other processors with a more relaxed memory model there may well be more to atomic64_read() and barrier().

x86 CPUs don’t do read-after-read reordering, so it is sufficient to prevent the compiler from doing any reordering. On other platforms such as PowerPC, things will look a lot different.

Related

Do the instruction cache and data cache sync with each other? [duplicate]

I like examples, so I wrote a bit of self-modifying code in c...
#include <stdio.h>
#include <sys/mman.h> // linux
int main(void) {
unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0); // get executable memory
c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
c[1] = 0b11000000; // to register rax (000) which holds the return value
// according to linux x86_64 calling convention
c[6] = 0b11000011; // return
for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
// rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
}
putchar('\n');
return 0;
}
...which works, apparently:
>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.
I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?
(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)
What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)
11.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the
processor causes the associated cache line (or lines) to be invalidated.
But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:
Applications that include self-modifying code use the same
linear address for modifying and fetching the instruction. Systems software, such as
a debugger, that might possibly modify an instruction using a different linear address
than that used to fetch the instruction, will execute a serializing operation, such as a
CPUID instruction, before the modified instruction is executed, which will automatically
resynchronize the instruction cache and prefetch queue.
For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):
3.3.1.2.1 Self-Modifying Code
When a processor modifies any memory location that can contain an instruction, software must
ensure that the instruction cache is made consistent with data memory and that the modifications
are made visible to the instruction fetching mechanism. This must be done even if the cache is
disabled or if the page is marked caching-inhibited.
It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.
The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.
Hope this help.
It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.
The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.
By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.
I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!
Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!
This is just for explanation purpose.
Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"...
Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)
mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
"a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
"a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).

Prohibit unaligned memory accesses on x86/x86_64

I want to emulate the system with prohibited unaligned memory accesses on the x86/x86_64.
Is there some debugging tool or special mode to do this?
I want to run many (CPU-intensive) tests on the several x86/x86_64 PCs when working with software (C/C++) designed for SPARC or some other similar CPU. But my access to Sparc is limited.
As I know, Sparc always checks alignment in memory reads and writes to be natural (reading a byte from any address, but reading a 4-byte word only allowed when address is divisible by 4).
May be Valgrind or PIN has such mode? Or special mode of compiler?
I'm searching for Linux non-commercial tool, but windows tools allowed too.
or may be there is secret CPU flag in EFLAGS?
I've just read question Does unaligned memory access always cause bus errors? which linked to Wikipedia article Segmentation Fault.
In the article, there's a wonderful reminder of rather uncommon Intel processor flags AC aka Alignment Check.
And here's how to enable it (from Wikipedia's Bus Error example, with a red-zone clobber bug fixed for x86-64 System V so this is safe on Linux and MacOS, and converted from Basic asm which is never a good idea inside functions: you want changes to AC to be ordered wrt. memory accesses.
#if defined(__GNUC__)
# if defined(__i386__)
/* Enable Alignment Checking on x86 */
__asm__("pushf\n orl $0x40000,(%%esp)\n popf" ::: "memory");
# elif defined(__x86_64__)
/* Enable Alignment Checking on x86_64 */
__asm__("add $-128, %%rsp \n" // skip past the red-zone, in case there is one and the compiler has local vars there.
"pushf\n"
"orl $0x40000,(%%rsp)\n"
"popf \n"
"sub $-128, %%rsp" // and restore the stack pointer.
::: "memory"); // ordered wrt. other mem access
# endif
#endif
Once enable it's working a lot like ARM alignment settings in /proc/cpu/alignment, see answer How to trap unaligned memory access? for examples.
Additionally, if you're using GCC, I suggest you enable -Wcast-align warnings. When building for a target with strict alignment requirements (ARM for example), GCC will report locations that might lead to unaligned memory access.
But note that libc's handwritten asm for memcpy and other functions will still make unaligned accesses, so setting AC is often not practical on x86 (including x86-64). GCC will sometimes emit asm that makes unaligned accesses even if your source doesn't, e.g. as an optimization to copy or zero two adjacent array elements or struct members at once.
It's tricky and I haven't done it personally, but I think you can do it in the following way:
x86_64 CPUs (specifically I've checked Intel Corei7 but I guess others as well) have a performance counter MISALIGN_MEM_REF which counter misaligned memory references.
So first of all, you can run your program and use "perf" tool under Linux to get a count of the number of misaligned access your code has done.
A more tricky and interesting hack would be to write a kernel module that programs the performance counter to generate an interrupt on overflow and get it to overflow the first unaligned load/store. Respond to this interrupt in your kernel module but sending a signal to your process.
This will, in effect, turn the x86_64 into a core that doesn't support unaligned access.
This wont be simple though - beside your code, the system libraries also use unaligned accesses, so it will be tricky to separate them from your own code.
Both GCC and Clang have UndefinedBehaviorSanitizer built in. One of those checks, alignment, can be enabled with -fsanitize=alignment. It'll emit code to check pointer alignment at runtime and abort if unaligned pointers are dereferenced.
See online documentation at:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
Perhaps you somehow could compile to SSE, with all aligned moves. Unaligned accesses with movaps are illegal and probably would behave as illegal unaligned accesses on other architechtures.

On x86 if [mem] is not 32-bit aligned, can "lock inc [mem]" still work fine?

On x86, if mem is 32-bit aligned, the mov operation is guaranteed to be atomic.
if [mem] is not 32-bit aligned, can lock inc [mem] sill work fine?
work fine:provide atomicity and not get partial value.
The Intel Instruction Set Reference for x86 and x64 mentions nothing about alignment requirements for the INC instruction. All it says in reference to LOCK is:
This instruction can be used with a LOCK prefix
to allow the instruction to be executed atomically.
The LOCK prefix documentation states:
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
The lock prefix will provide atomicity for unaligned memory access. On QPI systems this could be very slow. See this post on Intel website:
How to solve bugs of simultaneously misaligned memory accesses
http://software.intel.com/en-us/forums/showthread.php?t=75386
While the hardware might be fine with unaligned accesses, the code implementation might be relying on stealing the low 2 or 3 bits of the pointer (always zero for 32 or 64 bit aligned pointers respectively).
For example, the (Win32) InterlockedPushSList function doesn't store the low 2 or 3 bits of the pointer, so any attempt to push or pop an unaligned object will not work as intended. It is common for lock-free code to cram extra information into a pointer sized object. Most of the time this is not an issue though.
Intel's processors have always had excellent misaligned access performance. On the Nehalem (Core I7), they went all the way: any misaligned access fully within a cache line has no penalty, and misaligned accesses that cross a cache line boundary have an average 4.5-cycle penalty - very small.

InterlockedExchange and memory alignment

I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK.
Am i missing something, or whatever?
thanks
from Microsoft MSDN Library
Platform SDK: DLLs, Processes, and Threads
InterlockedExchange
The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
from Intel Software Developer’s Manual;
LOCK instruction
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families
Locked instructions have a total order.
Software Controlled Bus Locking
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
•Any boundary for an 8-bit access (locked or otherwise).
•16-bit boundary for locked word accesses.
•32-bit boundary for locked doubleword accesses.
•64-bit boundary for locked quadword accesses.
Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.
Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.
Hey, I answered a few questions related to this, also keep in mind;
There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
The documentation discrepency you refer, is probably just some documentation oversight.
If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.
I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);
.code
ALIGN 4
PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne load_slow
; Load within a cache line
sub esp,12
fild qword ptr [ecx]
fistp qword ptr [esp]
mov eax,[esp]
mov edx,4[esp]
add esp,12
ret
EXTRN __TBB_machine_store8_slow:PROC
.code
ALIGN 4
PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
fild qword ptr 8[esp]
fistp qword ptr [ecx]
ret
end
Anyhow, hope that clears at leat some of this up for you.
I don't understand where your Intel information is coming from.
To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3-I, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided.

Does one assembler instruction always execute atomically? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 2 years ago.
Today I came across this question:
you have a code
static int counter = 0;
void worker() {
for (int i = 1; i <= 10; i++)
counter++;
}
If worker would be called from two different threads, what value will counter have after both of them are finished?
I know that actually it could be anything. But my internal guts tells me, that counter++ will most likely be translated into single assembler instruction, and if both threads are execute on the same core, counter will be 20.
But what if those threads are run on different cores or processors, could there be a race condition in their microcode? Is one assembler instruction could always be viewed as atomic operation?
Specifically for x86, and regarding your example: counter++, there are a number of ways it could be compiled. The most trivial example is:
inc counter
This translates into the following micro operations:
load counter to a hidden register on the CPU
increment the register
store the updated register in counter
This is essentially the same as:
mov eax, counter
inc eax
mov counter, eax
Note that if some other agent updates counter between the load and the store, it won't be reflected in counter after the store. This agent could be another thread in the same core, another core in the same CPU, another CPU in the same system, or even some external agent that uses DMA (Direct Memory Access).
If you want to guarantee that this inc is atomic, use the lock prefix:
lock inc counter
lock guarantees that nobody can update counter between the load and the store.
Regarding more complicated instructions, you usually can't assume that they'll execute atomically, unless they support the lock prefix.
The answer is: it depends!
Here is some confusion around, what an assembler instruction is. Normally, one assembler instruction is translated into exactly one machine instruction. The excemption is when you use macros -- but you should be aware of that.
That said, the question boils down is one machine instruction atomic?
In the good old days, it was. But today, with complex CPUs, long running instructions, hyperthreading, ... it is not. Some CPUs guarantee that some increment/decrement instructions are atomic. The reason is, that they are neat for very simple syncronizing.
Also some CPU commands are not so problematic. When you have a simple fetch (of one piece of data that the processor can fetch in one piece) -- the fetch itself is of course atomic, because there is nothing to be divided at all. But when you have unaligned data, it becomes complicated again.
The answer is: It depends. Carefully read the machine instruction manual of the vendor. In doubt, it is not!
Edit:
Oh, I saw it now, you also ask for ++counter. The statement "most likely to be translated" can not be trusted at all. This largely depends also on the compiler of course! It gets more difficult when the compiler is making different optimizations.
Not always - on some architectures one assembly instruction is translated into one machine code instruction, while on others it does not.
In addition - you can never assume that the program language you are using is compiling a seemingly simple line of code into one assembly instruction. Moreover, on some architectures, you cannot assume one machine code will execute atomically.
Use proper synchronization techniques instead, dependent on the language you are coding in.
Increment/decrement operations on 32-bit or less integer variables on a single 32-bit processor with no Hyper-Threading Technology are atomic.
On a processor with Hyper-Threading Technology or on a multi-processor system, the increment/decrement operations are NOT guaranteed to be executed atomicaly.
Invalidated by Nathan's comment:
If I remember my Intel x86 assembler correctly, the INC instruction only works for registers and does not directly work for memory locations.
So a counter++ would not be a single instruction in assembler (just ignoring the post-increment part). It would be at least three instructions: load counter variable to register, increment register, load register back to counter. And that is just for x86 architecture.
In short, don't rely on it being atomic unless it is specified by the language specification and that the compiler that you are using supports the specifications.
Another issue is that if you don't declare the variable as volatile, the code generated would probably not update the memory at every loop iteration, only at the end of the loop the memory would be updated.
No you cannot assume this. Unless it clearly stated in compiler specification. And moreover no one can guarantee that one single assembler instruction indeed atomic. In practice each assembler instruction is translated to number of microcode operation - uops.
Also the issue of race condition is tightly coupled with memory model(coherence, sequential, release coherence and etc.), for each one the answer and result could be different.
In most cases, no. In fact, on x86, you can perform the instruction
push [address]
which, in C, would be something like:
*stack-- = *address;
This performs two memory transfers in one instruction.
That's basically impossible to do in 1 clock cycle, not the least because one memory transfer is also not possible in one cycle!
Might not be an actual answer to your question, but (assuming this is C#, or another .NET language) if you want counter++ to really be multi-threaded atomic, you could use System.Threading.Interlocked.Increment(counter).
See other answers for actual information on the many different ways why/how counter++ could not be atomic. ;-)
On many other processors, the seperation between memory system and processor is bigger. (often these processor can be little or big-endian depending on memory system, like ARM and PowerPC), this also has consequences for atomic behaviour if the memory system can reorder reads and writes.
For this purpose, there are memory barriers (http://en.wikipedia.org/wiki/Memory_barrier)
So in short, while atomic instructions are enough on intel (with the relevant lock prefixes), more must be done on non-intel, since the memory I/O might not be in the same order.
This is a known problem when porting "lock-free" solutions from Intel to other architectures.
(Note that multiprocessor (not multicore) systems on x86 also seem to need memory barriers, at least in 64-bit mode.
I think that you'll get a race condition on access.
If you wanted to ensure an atomic operation in incrementing counter then you'd need to use ++counter.

Resources