Does `xchg` encompass `mfence` assuming no non-temporal instructions? - multithreading

I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence and xchg under the assumption of no non-temporal instructions.
The Intel instruction reference for xchg mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.
For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception. Load operations that reference weakly
ordered memory types (such as the WC memory type) may not be
serialized.
The mfence documentation claims the following.
Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that every load and
store instruction that precedes the MFENCE instruction in program
order becomes globally visible before any load or store instruction
that follows the MFENCE instruction. 1 The MFENCE instruction is
ordered with respect to all load and store instructions, other MFENCE
instructions, any LFENCE and SFENCE instructions, and any serializing
instructions (such as the CPUID instruction). MFENCE does not
serialize the instruction stream.
If we ignore weakly ordered memory types, does xchg (which implies lock) encompass all of mfence's guarantees with respect to memory ordering?

Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).
mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.

Related

Memory Protection Keys Memory Reordering

Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.

Do locked instructions provide a barrier between weakly-ordered accesses?

On x86, lock-prefixed instructions such as lock cmpxchg provide barrier semantics in addition to their atomic operation: for normal memory access on write-back memory regions, reads and writes are not re-ordered across lock-prefixed instructions, per section 8.2.2 of Volume 3 of the Intel SDM:
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
This section applies only to write-back memory types. In the same list, you find an exception where it notes that weakly ordered stores are not ordered:
Reads are not reordered with other reads.
Writes are not reordered
with older reads.
Writes to memory are not reordered with other
writes, with the following exceptions: —
streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and —
string operations (see Section 8.2.4.1).
Note that there is no exception made for non-temporal instructions in any other items in the list, e.g., in the item referring to lock-prefixed instructions.
In various other sections of the guide, it is mentioned that the mfence and/or sfence instructions can be used to order memory when weakly ordered (non-temporal) instructions are used. These sections generally don't mention lock-prefixed instruction as an alternative.
All that leaves me uncertain: do lock-prefixed instructions provide the same full barrier that mfence provides between weakly ordered (non-temporal) instructions on WB memory? The same question applies again but to any type of access on WC memory.
On all 64-bit AMD processors, MFENCE is a fully serializing instruction and the Lock-prefixed instructions are not. However, both serialize all memory accesses according to the AMD manual V2 7.4.2:
All previous loads and stores complete to memory or I/O space before a
memory access for an I/O, locked or serializing instruction is issued.
All loads and stores associated with the I/O and locked instructions
complete to memory (no buffered stores) before a load or store from a
subsequent instruction is issued.
There are no exceptions or erratum related to the serialization properties of these instructions.
It's clear from the Intel manual and documents that both serialize all stores with no exceptions or related erratum. MFENCE also serializes all loads, with one errata documented for most processors based on Skylake, Kaby Lake, and Coffee Lake microarchitectures, which states that MOVNTDQA from WC memory may passs earlier MFENCE instructions. In addition, many processors based on the Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake, Kaby Lake, Coffee Lake, and Silvermont microarchitectures have an errata that says that MOVNTDQA from WC memory may passs earlier locked instructions. Processors based on the Core, Westmere, Sunny Cove, and Goldmont microarchitectures don't have this errata.
The quote from Necrolis's answer says that the lock prefix may not serialize load operations that reference weakly ordered memory types on the Pentium 4 processors. My understanding is that this looks like a bug in the Pentium 4 processors and it doesn't apply to any other processors. Although it's worth noting that it's not documented in the spec update documents of the Pentium 4 processors.
#PeterCordes's experiments show that, on Skylake, locking instructions don't seem to block ALU instructions from being executed out-of-order while mfence does serialize ALU instructions (potentially behaving identically to lfence + a store-buffer flush like a locked instruction). However, I think this is an implementation detail.
Bus locks (via the LOCK opcode prefix) produce a full fence*, however, on WC memory they don't provide the load fence, this is documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.1.2:
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
*See Intel's 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.2.3.9 for an example

On x86 if [mem] is not 32-bit aligned, can "lock inc [mem]" still work fine?

On x86, if mem is 32-bit aligned, the mov operation is guaranteed to be atomic.
if [mem] is not 32-bit aligned, can lock inc [mem] sill work fine?
work fine:provide atomicity and not get partial value.
The Intel Instruction Set Reference for x86 and x64 mentions nothing about alignment requirements for the INC instruction. All it says in reference to LOCK is:
This instruction can be used with a LOCK prefix
to allow the instruction to be executed atomically.
The LOCK prefix documentation states:
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
The lock prefix will provide atomicity for unaligned memory access. On QPI systems this could be very slow. See this post on Intel website:
How to solve bugs of simultaneously misaligned memory accesses
http://software.intel.com/en-us/forums/showthread.php?t=75386
While the hardware might be fine with unaligned accesses, the code implementation might be relying on stealing the low 2 or 3 bits of the pointer (always zero for 32 or 64 bit aligned pointers respectively).
For example, the (Win32) InterlockedPushSList function doesn't store the low 2 or 3 bits of the pointer, so any attempt to push or pop an unaligned object will not work as intended. It is common for lock-free code to cram extra information into a pointer sized object. Most of the time this is not an issue though.
Intel's processors have always had excellent misaligned access performance. On the Nehalem (Core I7), they went all the way: any misaligned access fully within a cache line has no penalty, and misaligned accesses that cross a cache line boundary have an average 4.5-cycle penalty - very small.

On a multicore x86, is a LOCK necessary as a prefix to XCHG?

If mem is a shared memory location, do I need:
XCHG EAX,mem
or:
LOCK XCHG EAX,mem
to do the exchange atomically?
Googling this yields both yes and no answers. Does anyone know this definitively?
Intel's documentation seems pretty clear that it is redundant.
IA-32 Intel® Architecture
Software Developer’s Manual
Volume 3A:
System Programming Guide, Part 1
7.1.2.1 says:
The operations on which the processor automatically follows the LOCK semantics are as
follows:
When executing an XCHG instruction that references memory.
Similarly,
Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 2B:
Instruction Set Reference, N-Z
XCHG:
If a memory operand is referenced, the processor’s locking protocol is automatically
implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL.
Note that this doesn't actually meant that the LOCK# signal is asserted whether or not the LOCK prefix is used, 7.1.4 describes how on later processors locking semantics are preserved without a LOCK# if the memory location is cached. Clever, and definitely over my head.
Since 386 days, xchg will assert the Lock signal whether or not you put the lock prefix on it. Intel's documentation covers this quite clearly in IA-32 instruction set reference N-Z.
As per the 80386 Instruction Manual, BUS LOCK is asserted for the duration of the exchange. The LOCK prefix has no precedence for this operation and neither does the value of the I/O Privilege Level.
My suggestion is that since the documentation states that BUS LOCK is asserted regardless of the presence of the LOCK prefix, LOCK XCHG EAX, mem is otherwise safe. When in doubt, add a LOCK.

InterlockedExchange and memory alignment

I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK.
Am i missing something, or whatever?
thanks
from Microsoft MSDN Library
Platform SDK: DLLs, Processes, and Threads
InterlockedExchange
The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
from Intel Software Developer’s Manual;
LOCK instruction
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families
Locked instructions have a total order.
Software Controlled Bus Locking
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
•Any boundary for an 8-bit access (locked or otherwise).
•16-bit boundary for locked word accesses.
•32-bit boundary for locked doubleword accesses.
•64-bit boundary for locked quadword accesses.
Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.
Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.
Hey, I answered a few questions related to this, also keep in mind;
There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
The documentation discrepency you refer, is probably just some documentation oversight.
If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.
I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);
.code
ALIGN 4
PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne load_slow
; Load within a cache line
sub esp,12
fild qword ptr [ecx]
fistp qword ptr [esp]
mov eax,[esp]
mov edx,4[esp]
add esp,12
ret
EXTRN __TBB_machine_store8_slow:PROC
.code
ALIGN 4
PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
fild qword ptr 8[esp]
fistp qword ptr [ecx]
ret
end
Anyhow, hope that clears at leat some of this up for you.
I don't understand where your Intel information is coming from.
To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3-I, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided.

Resources