I'm using a memory mapped file and I need to use an atomic store on Go. I would use StoreUint64() if I were working on regularly allocated memory. However, I'm not sure how atomic operations work on memory mapped files.
Is it safe to use StoreUint64() on memory mapped files?
It's safe. For example, on amd64, it uses the XCHG instruction.
func StoreUint64
func StoreUint64(addr *uint64, val uint64)
StoreUint64 atomically stores val into *addr.
src/sync/atomic/asm_amd64.s;
TEXT ·StoreUint64(SB),NOSPLIT,$0-16
MOVQ addr+0(FP), BP
MOVQ val+8(FP), AX
XCHGQ AX, 0(BP)
RET
Intel 64 and IA-32 Architectures Software Developer's Manual
XCHG—Exchange Register/Memory with Register
Description
Exchanges the contents of the destination (first) and
source (second) operands. The operands can be two general-purpose
registers or a register and a memory location. If a memory operand is
referenced, the processor’s locking protocol is automatically
implemented for the duration of the exchange operation, regardless of
the presence or absence of the LOCK prefix or of the value of the
IOPL.
Related
I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence and xchg under the assumption of no non-temporal instructions.
The Intel instruction reference for xchg mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.
For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception. Load operations that reference weakly
ordered memory types (such as the WC memory type) may not be
serialized.
The mfence documentation claims the following.
Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that every load and
store instruction that precedes the MFENCE instruction in program
order becomes globally visible before any load or store instruction
that follows the MFENCE instruction. 1 The MFENCE instruction is
ordered with respect to all load and store instructions, other MFENCE
instructions, any LFENCE and SFENCE instructions, and any serializing
instructions (such as the CPUID instruction). MFENCE does not
serialize the instruction stream.
If we ignore weakly ordered memory types, does xchg (which implies lock) encompass all of mfence's guarantees with respect to memory ordering?
Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).
mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.
I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:
mov rax, 0x0C
mov rdi, 0x00
syscall
results in
rax 0x401000
The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?
The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.
There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)
newbrk = PAGE_ALIGN(brk); // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
goto set_brk; // no need to map / unmap any pages, just update mm->brk
This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.
Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.
AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.
brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).
I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:
mov rax, 0x0C
mov rdi, 0x00
syscall
results in
rax 0x401000
The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?
The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.
There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)
newbrk = PAGE_ALIGN(brk); // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
goto set_brk; // no need to map / unmap any pages, just update mm->brk
This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.
Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.
AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.
brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).
On x86, if mem is 32-bit aligned, the mov operation is guaranteed to be atomic.
if [mem] is not 32-bit aligned, can lock inc [mem] sill work fine?
work fine:provide atomicity and not get partial value.
The Intel Instruction Set Reference for x86 and x64 mentions nothing about alignment requirements for the INC instruction. All it says in reference to LOCK is:
This instruction can be used with a LOCK prefix
to allow the instruction to be executed atomically.
The LOCK prefix documentation states:
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
The lock prefix will provide atomicity for unaligned memory access. On QPI systems this could be very slow. See this post on Intel website:
How to solve bugs of simultaneously misaligned memory accesses
http://software.intel.com/en-us/forums/showthread.php?t=75386
While the hardware might be fine with unaligned accesses, the code implementation might be relying on stealing the low 2 or 3 bits of the pointer (always zero for 32 or 64 bit aligned pointers respectively).
For example, the (Win32) InterlockedPushSList function doesn't store the low 2 or 3 bits of the pointer, so any attempt to push or pop an unaligned object will not work as intended. It is common for lock-free code to cram extra information into a pointer sized object. Most of the time this is not an issue though.
Intel's processors have always had excellent misaligned access performance. On the Nehalem (Core I7), they went all the way: any misaligned access fully within a cache line has no penalty, and misaligned accesses that cross a cache line boundary have an average 4.5-cycle penalty - very small.
I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK.
Am i missing something, or whatever?
thanks
from Microsoft MSDN Library
Platform SDK: DLLs, Processes, and Threads
InterlockedExchange
The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
from Intel Software Developer’s Manual;
LOCK instruction
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families
Locked instructions have a total order.
Software Controlled Bus Locking
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
•Any boundary for an 8-bit access (locked or otherwise).
•16-bit boundary for locked word accesses.
•32-bit boundary for locked doubleword accesses.
•64-bit boundary for locked quadword accesses.
Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.
Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.
Hey, I answered a few questions related to this, also keep in mind;
There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
The documentation discrepency you refer, is probably just some documentation oversight.
If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.
I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);
.code
ALIGN 4
PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne load_slow
; Load within a cache line
sub esp,12
fild qword ptr [ecx]
fistp qword ptr [esp]
mov eax,[esp]
mov edx,4[esp]
add esp,12
ret
EXTRN __TBB_machine_store8_slow:PROC
.code
ALIGN 4
PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
fild qword ptr 8[esp]
fistp qword ptr [ecx]
ret
end
Anyhow, hope that clears at leat some of this up for you.
I don't understand where your Intel information is coming from.
To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3-I, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided.