InterlockedExchange and memory alignment - multithreading

I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK.
Am i missing something, or whatever?
from Microsoft MSDN Library
Platform SDK: DLLs, Processes, and Threads
The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
from Intel Software Developer’s Manual;
LOCK instruction
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families
Locked instructions have a total order.
Software Controlled Bus Locking
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
•Any boundary for an 8-bit access (locked or otherwise).
•16-bit boundary for locked word accesses.
•32-bit boundary for locked doubleword accesses.
•64-bit boundary for locked quadword accesses.

Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.

Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.

Hey, I answered a few questions related to this, also keep in mind;
There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
The documentation discrepency you refer, is probably just some documentation oversight.
If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.
I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);
PUBLIC c __TBB_machine_load8
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne load_slow
; Load within a cache line
sub esp,12
fild qword ptr [ecx]
fistp qword ptr [esp]
mov eax,[esp]
mov edx,4[esp]
add esp,12
EXTRN __TBB_machine_store8_slow:PROC
PUBLIC c __TBB_machine_store8
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
fild qword ptr 8[esp]
fistp qword ptr [ecx]
Anyhow, hope that clears at leat some of this up for you.

I don't understand where your Intel information is coming from.
To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3-I, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided.


Can you have torn reads/writes between two threads pinned to different processors, if the system is cache coherent?

If you have two threads in the same processor, you can have a torn read/write.
For example, on a 32 bit system with thread 1 and thread 2 running on the same core:
Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
Thread 2 reads X as 0xffffffff00000000
Thread 1 writes the last 32 bits.
The torn read happens in step 3.
But what if the following conditions are met:
Thread 1 and Thread 2 are pinned to different cores
The system uses MESI protocol to achieve cache coherence
In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?
Yes, you can have tearing.
A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.
Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.
The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.
(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)
Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.
If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?
By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.
A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.
But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)
Further reading:
Atomicity on x86 - how exactly a single load or store can be atomic
Why is integer assignment on a naturally aligned variable atomic on x86?
Can num++ be atomic for 'int num'? - how atomic RMWs interact with MESI. (For x86-style single instructions like lock add [mem], eax. LL/SC machines just detect that they lost control of the cache line in there somewhere and report failure.)

Why doesn't copy_user_enhanced_fast_string use AVX if it is available?

When understanding profiling result of my application (I/O-heavy) I faced copy_user_enhanced_fast_string to be one of the hottest region. It is called when copying between user and kernel spaces. The implementation on x86 looks as:
cmpl $64,%edx
jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
movl %edx,%ecx
1: rep
xorl %eax,%eax
.section .fixup,"ax"
12: movl %ecx,%edx /* ecx is zerorest also */
jmp .Lcopy_user_handle_tail
_ASM_EXTABLE_UA(1b, 12b)
Why wasn't vmovaps/vmovups used for that? Hasn't it proven that AVX has no performance advantage for copying where it is available?
Kernel code can only safely use FPU / SIMD between kernel_fpu_begin() / kernel_fpu_end() to trigger an xsave (and xrstor before returning to user-space). Or xsaveopt or whatever.
That's a lot of overhead, and isn't worth it outside of a few rare cases (like md RAID5 / RAID6 parity creation / use.)
Unfortunately this means only GP-integer registers are available for most kernel code. The difference between an AVX memcpy loop and a rep movsb memcpy is not worth an xsave/xrstor on every system call.
Context switch vs. just entering the kernel background:
In user-space, the kernel handles state save/restore on context switches between user-space tasks. In the kernel, you want to avoid a heavy FPU save/restore every time you enter the kernel (e.g. for a system call) when you're about to return to the same user-space, so you just save the GP-integer regs.
For known-size copies, not having SSE/AVX is not too bad, especially on CPUs with the ERMSB feature (which is when this copy function is used, hence the enhanced_fast_string in the name). For medium to large aligned copies, rep movsb is nearly as fast on Intel CPUs at least, and hopefully also AMD. See Enhanced REP MOVSB for memcpy. Or without ERMSB, at least with rep movsq + cleanup.
In a 64-bit kernel, GP integer regs are half the size of XMM regs. For small copies (below the kernel's 64-byte threshold), 8x GP-integer 8-byte load and 8-byte store should be pretty efficient compared to the overhead of a system call in general. 4x XMM load/store would be nice, but it's a tradeoff against saving FPU state.
Not having SIMD is significantly worse for strlen/strcpy where pcmpeqb is very good vs. a 4 or 8-byte at a time bithack. And SSE2 is baseline for x86-64, so an x86-64 kernel could depend on that without dynamic dispatch if not for the problem of saving FPU state.
You could in theory eat the SSE/AVX transition penalty and do like some bad Windows drivers and just manually save/restore the low 128 of a vector reg with legacy SSE instructions. (This is why legacy SSE instructions don't zero the upper bytes of the full YMM / ZMM). IDK if anyone's benchmarked doing that for a kernel-mode strcpy or strlen, or memcpy.

Does `xchg` encompass `mfence` assuming no non-temporal instructions?

I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence and xchg under the assumption of no non-temporal instructions.
The Intel instruction reference for xchg mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.
For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception. Load operations that reference weakly
ordered memory types (such as the WC memory type) may not be
The mfence documentation claims the following.
Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that every load and
store instruction that precedes the MFENCE instruction in program
order becomes globally visible before any load or store instruction
that follows the MFENCE instruction. 1 The MFENCE instruction is
ordered with respect to all load and store instructions, other MFENCE
instructions, any LFENCE and SFENCE instructions, and any serializing
instructions (such as the CPUID instruction). MFENCE does not
serialize the instruction stream.
If we ignore weakly ordered memory types, does xchg (which implies lock) encompass all of mfence's guarantees with respect to memory ordering?
Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).
mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.

On x86 if [mem] is not 32-bit aligned, can "lock inc [mem]" still work fine?

On x86, if mem is 32-bit aligned, the mov operation is guaranteed to be atomic.
if [mem] is not 32-bit aligned, can lock inc [mem] sill work fine?
work fine:provide atomicity and not get partial value.
The Intel Instruction Set Reference for x86 and x64 mentions nothing about alignment requirements for the INC instruction. All it says in reference to LOCK is:
This instruction can be used with a LOCK prefix
to allow the instruction to be executed atomically.
The LOCK prefix documentation states:
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
The lock prefix will provide atomicity for unaligned memory access. On QPI systems this could be very slow. See this post on Intel website:
How to solve bugs of simultaneously misaligned memory accesses
While the hardware might be fine with unaligned accesses, the code implementation might be relying on stealing the low 2 or 3 bits of the pointer (always zero for 32 or 64 bit aligned pointers respectively).
For example, the (Win32) InterlockedPushSList function doesn't store the low 2 or 3 bits of the pointer, so any attempt to push or pop an unaligned object will not work as intended. It is common for lock-free code to cram extra information into a pointer sized object. Most of the time this is not an issue though.
Intel's processors have always had excellent misaligned access performance. On the Nehalem (Core I7), they went all the way: any misaligned access fully within a cache line has no penalty, and misaligned accesses that cross a cache line boundary have an average 4.5-cycle penalty - very small.

memory barrier and atomic_t on linux

Recently, I am reading some Linux kernel space codes, I see this
uint64_t used;
uint64_t blocked;
used = atomic64_read(&g_variable->used); //#1
barrier(); //#2
blocked = atomic64_read(&g_variable->blocked); //#3
What is the semantics of this code snippet? Does it make sure #1 executes before #3 by #2.
But I am a litter bit confused, becasue
#A In 64 bit platform, atomic64_read macro is expanded to
used = (&g_variable->used)->counter // where counter is volatile.
In 32 bits platform, it was converted to use lock cmpxchg8b. I assume these two have the same semantic, and for 64 bits version, I think it means:
all-or-nothing, we can exclude case where address is unaligned and word size large than CPU's native word size.
no optimization, force CPU read from memory location.
atomic64_read doesn't have semantic for preserve read ordering!!! see this
#B the barrier macro is defined as
/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")
From the wiki this just prevents gcc compiler from reordering read and write.
What i am confused is how does it disable reorder optimization for CPU? In addition, can i think barrier macro is full fence?
32-bit x86 processors don't provide simple atomic read operations for 64-bit types. The only atomic operation on 64-bit types on such CPUs that deals with "normal" registers is LOCK CMPXCHG8B, which is why it is used here. The alternative is to use MOVQ and MMX/XMM registers, but that requires knowledge of the FPU state/registers, and requires that all operations on that value are done with the MMX/XMM instructions.
On 64-bit x86_64 processors, aligned reads of 64-bit types are atomic, and can be done with a MOV instruction, so only a plain read is required --- the use of volatile is just to ensure that the compiler actually does a read, and doesn't cache a previous value.
As for the read ordering, the inline assembler you quote ensures that the compiler emits the instructions in the right order, and this is all that is required on x86/x86_64 CPUs, provided the writes are correctly sequenced. LOCKed writes on x86 have a total ordering; plain MOV writes provide "causal consistency", so if thread A does x=1 then y=2, if thread B reads y==2 then a subsequent read of x will see x==1.
On IA-64, PowerPC, SPARC, and other processors with a more relaxed memory model there may well be more to atomic64_read() and barrier().
x86 CPUs don’t do read-after-read reordering, so it is sufficient to prevent the compiler from doing any reordering. On other platforms such as PowerPC, things will look a lot different.
