using tbb atomic operations in Xeon Phi - tbb

I'm using tbb compare_and_swap operation in Xeon Phi in a lock-free algorithm. Since Xeon Phi is an in order machine, it doesn't support sfence operation. So will the atomic operations work correctly on Xeon Phi?

Yes, they definitely work correctly, most of TBB itself is based on the atomic operations. And sfence is not required for atomic operations to work correctly, it's a standalone memory barrier while atomic operations imply memory barriers themselves. TBB doesn't use sfence even on the regular Xeons, it uses mfence for full memory fence instead. For Xeon Phi, it is substituted by no-op atomic operation, e.g. mic_common.h of TBB contains the following definitions:
/** Intel(R) Many Integrated Core Architecture does not support mfence and pause instructions **/
#define __TBB_full_memory_fence() __asm__ __volatile__("lock; addl $0,(%%rsp)":::"memory")
#define __TBB_Pause(x) _mm_delay_32(16*(x))

Related

Does `xchg` encompass `mfence` assuming no non-temporal instructions?

I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence and xchg under the assumption of no non-temporal instructions.
The Intel instruction reference for xchg mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.
For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception. Load operations that reference weakly
ordered memory types (such as the WC memory type) may not be
serialized.
The mfence documentation claims the following.
Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that every load and
store instruction that precedes the MFENCE instruction in program
order becomes globally visible before any load or store instruction
that follows the MFENCE instruction. 1 The MFENCE instruction is
ordered with respect to all load and store instructions, other MFENCE
instructions, any LFENCE and SFENCE instructions, and any serializing
instructions (such as the CPUID instruction). MFENCE does not
serialize the instruction stream.
If we ignore weakly ordered memory types, does xchg (which implies lock) encompass all of mfence's guarantees with respect to memory ordering?
Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).
mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.

Do locked instructions provide a barrier between weakly-ordered accesses?

On x86, lock-prefixed instructions such as lock cmpxchg provide barrier semantics in addition to their atomic operation: for normal memory access on write-back memory regions, reads and writes are not re-ordered across lock-prefixed instructions, per section 8.2.2 of Volume 3 of the Intel SDM:
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
This section applies only to write-back memory types. In the same list, you find an exception where it notes that weakly ordered stores are not ordered:
Reads are not reordered with other reads.
Writes are not reordered
with older reads.
Writes to memory are not reordered with other
writes, with the following exceptions: —
streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and —
string operations (see Section 8.2.4.1).
Note that there is no exception made for non-temporal instructions in any other items in the list, e.g., in the item referring to lock-prefixed instructions.
In various other sections of the guide, it is mentioned that the mfence and/or sfence instructions can be used to order memory when weakly ordered (non-temporal) instructions are used. These sections generally don't mention lock-prefixed instruction as an alternative.
All that leaves me uncertain: do lock-prefixed instructions provide the same full barrier that mfence provides between weakly ordered (non-temporal) instructions on WB memory? The same question applies again but to any type of access on WC memory.
On all 64-bit AMD processors, MFENCE is a fully serializing instruction and the Lock-prefixed instructions are not. However, both serialize all memory accesses according to the AMD manual V2 7.4.2:
All previous loads and stores complete to memory or I/O space before a
memory access for an I/O, locked or serializing instruction is issued.
All loads and stores associated with the I/O and locked instructions
complete to memory (no buffered stores) before a load or store from a
subsequent instruction is issued.
There are no exceptions or erratum related to the serialization properties of these instructions.
It's clear from the Intel manual and documents that both serialize all stores with no exceptions or related erratum. MFENCE also serializes all loads, with one errata documented for most processors based on Skylake, Kaby Lake, and Coffee Lake microarchitectures, which states that MOVNTDQA from WC memory may passs earlier MFENCE instructions. In addition, many processors based on the Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake, Kaby Lake, Coffee Lake, and Silvermont microarchitectures have an errata that says that MOVNTDQA from WC memory may passs earlier locked instructions. Processors based on the Core, Westmere, Sunny Cove, and Goldmont microarchitectures don't have this errata.
The quote from Necrolis's answer says that the lock prefix may not serialize load operations that reference weakly ordered memory types on the Pentium 4 processors. My understanding is that this looks like a bug in the Pentium 4 processors and it doesn't apply to any other processors. Although it's worth noting that it's not documented in the spec update documents of the Pentium 4 processors.
#PeterCordes's experiments show that, on Skylake, locking instructions don't seem to block ALU instructions from being executed out-of-order while mfence does serialize ALU instructions (potentially behaving identically to lfence + a store-buffer flush like a locked instruction). However, I think this is an implementation detail.
Bus locks (via the LOCK opcode prefix) produce a full fence*, however, on WC memory they don't provide the load fence, this is documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.1.2:
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
*See Intel's 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.2.3.9 for an example

AVX float4/double4 struct

I am looking for a AVX-256/512 code for float4 / double4 struct that overloads the basic operations *,+,/,-,scale by scalar, etc to get a quick performance boost from vector operations in a code written using float4/double4. OpenCL has these data types as intrinsics but c++ code running on the XeonPhi needs new implementations taking advantage of the 512-bit SIMD units.
What you are seeking is Agner Fog's Vector Class Library(VCL). I have used this mostly replace the vector types in OpenCL.
With the VCL float4 is Vec4f and double4 is Vec4d. Like OpenCL you don't need to worry about AVX vs AVX512. If you use Vec8d and compile for AVX it will emulate AVX512 using two AVX registers.
The VCL has all the operations you want such as *,+,/,-,+=,-=,/=,*=, multiply and divide by scalar and many more features.
The main difference with OpenCL and the VCL is that OpenCL basically creates a CPU dispatcher. Whereas with the VCL you have to write a CPU dispatcher yourself (it includes some example code to do this with documentation). The VCL has optimized functions for SSE2 through AVX512 so you can target several different instruction sets. There is even a special version of the VCL for the Knights Corner Xeon Phi.
Another feature from OpenCL that I miss is the syntax for permuting. In OpenCL to reverse the order of the components of float4 you could do v.wzyx whereas with the VCL you would do permute4f<3,2,1,0>(v). I might be possible to create this syntax with C++ but I am not sure.
Using the VCL, OpenMP, and a custom CPU dispatcher I have largely replaced OpenCL on the CPU.

memory barrier and atomic_t on linux

Recently, I am reading some Linux kernel space codes, I see this
uint64_t used;
uint64_t blocked;
used = atomic64_read(&g_variable->used); //#1
barrier(); //#2
blocked = atomic64_read(&g_variable->blocked); //#3
What is the semantics of this code snippet? Does it make sure #1 executes before #3 by #2.
But I am a litter bit confused, becasue
#A In 64 bit platform, atomic64_read macro is expanded to
used = (&g_variable->used)->counter // where counter is volatile.
In 32 bits platform, it was converted to use lock cmpxchg8b. I assume these two have the same semantic, and for 64 bits version, I think it means:
all-or-nothing, we can exclude case where address is unaligned and word size large than CPU's native word size.
no optimization, force CPU read from memory location.
atomic64_read doesn't have semantic for preserve read ordering!!! see this
#B the barrier macro is defined as
/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")
From the wiki this just prevents gcc compiler from reordering read and write.
What i am confused is how does it disable reorder optimization for CPU? In addition, can i think barrier macro is full fence?
32-bit x86 processors don't provide simple atomic read operations for 64-bit types. The only atomic operation on 64-bit types on such CPUs that deals with "normal" registers is LOCK CMPXCHG8B, which is why it is used here. The alternative is to use MOVQ and MMX/XMM registers, but that requires knowledge of the FPU state/registers, and requires that all operations on that value are done with the MMX/XMM instructions.
On 64-bit x86_64 processors, aligned reads of 64-bit types are atomic, and can be done with a MOV instruction, so only a plain read is required --- the use of volatile is just to ensure that the compiler actually does a read, and doesn't cache a previous value.
As for the read ordering, the inline assembler you quote ensures that the compiler emits the instructions in the right order, and this is all that is required on x86/x86_64 CPUs, provided the writes are correctly sequenced. LOCKed writes on x86 have a total ordering; plain MOV writes provide "causal consistency", so if thread A does x=1 then y=2, if thread B reads y==2 then a subsequent read of x will see x==1.
On IA-64, PowerPC, SPARC, and other processors with a more relaxed memory model there may well be more to atomic64_read() and barrier().
x86 CPUs don’t do read-after-read reordering, so it is sufficient to prevent the compiler from doing any reordering. On other platforms such as PowerPC, things will look a lot different.

InterlockedExchange and memory alignment

I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK.
Am i missing something, or whatever?
thanks
from Microsoft MSDN Library
Platform SDK: DLLs, Processes, and Threads
InterlockedExchange
The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
from Intel Software Developer’s Manual;
LOCK instruction
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families
Locked instructions have a total order.
Software Controlled Bus Locking
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
•Any boundary for an 8-bit access (locked or otherwise).
•16-bit boundary for locked word accesses.
•32-bit boundary for locked doubleword accesses.
•64-bit boundary for locked quadword accesses.
Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.
Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.
Hey, I answered a few questions related to this, also keep in mind;
There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
The documentation discrepency you refer, is probably just some documentation oversight.
If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.
I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);
.code
ALIGN 4
PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne load_slow
; Load within a cache line
sub esp,12
fild qword ptr [ecx]
fistp qword ptr [esp]
mov eax,[esp]
mov edx,4[esp]
add esp,12
ret
EXTRN __TBB_machine_store8_slow:PROC
.code
ALIGN 4
PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
mov ecx,4[esp]
test ecx,7
jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
fild qword ptr 8[esp]
fistp qword ptr [ecx]
ret
end
Anyhow, hope that clears at leat some of this up for you.
I don't understand where your Intel information is coming from.
To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3-I, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided.

Resources