Do locked instructions provide a barrier between weakly-ordered accesses? - multithreading

On x86, lock-prefixed instructions such as lock cmpxchg provide barrier semantics in addition to their atomic operation: for normal memory access on write-back memory regions, reads and writes are not re-ordered across lock-prefixed instructions, per section 8.2.2 of Volume 3 of the Intel SDM:
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
This section applies only to write-back memory types. In the same list, you find an exception where it notes that weakly ordered stores are not ordered:
Reads are not reordered with other reads.
Writes are not reordered
with older reads.
Writes to memory are not reordered with other
writes, with the following exceptions: —
streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and —
string operations (see Section 8.2.4.1).
Note that there is no exception made for non-temporal instructions in any other items in the list, e.g., in the item referring to lock-prefixed instructions.
In various other sections of the guide, it is mentioned that the mfence and/or sfence instructions can be used to order memory when weakly ordered (non-temporal) instructions are used. These sections generally don't mention lock-prefixed instruction as an alternative.
All that leaves me uncertain: do lock-prefixed instructions provide the same full barrier that mfence provides between weakly ordered (non-temporal) instructions on WB memory? The same question applies again but to any type of access on WC memory.

On all 64-bit AMD processors, MFENCE is a fully serializing instruction and the Lock-prefixed instructions are not. However, both serialize all memory accesses according to the AMD manual V2 7.4.2:
All previous loads and stores complete to memory or I/O space before a
memory access for an I/O, locked or serializing instruction is issued.
All loads and stores associated with the I/O and locked instructions
complete to memory (no buffered stores) before a load or store from a
subsequent instruction is issued.
There are no exceptions or erratum related to the serialization properties of these instructions.
It's clear from the Intel manual and documents that both serialize all stores with no exceptions or related erratum. MFENCE also serializes all loads, with one errata documented for most processors based on Skylake, Kaby Lake, and Coffee Lake microarchitectures, which states that MOVNTDQA from WC memory may passs earlier MFENCE instructions. In addition, many processors based on the Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake, Kaby Lake, Coffee Lake, and Silvermont microarchitectures have an errata that says that MOVNTDQA from WC memory may passs earlier locked instructions. Processors based on the Core, Westmere, Sunny Cove, and Goldmont microarchitectures don't have this errata.
The quote from Necrolis's answer says that the lock prefix may not serialize load operations that reference weakly ordered memory types on the Pentium 4 processors. My understanding is that this looks like a bug in the Pentium 4 processors and it doesn't apply to any other processors. Although it's worth noting that it's not documented in the spec update documents of the Pentium 4 processors.
#PeterCordes's experiments show that, on Skylake, locking instructions don't seem to block ALU instructions from being executed out-of-order while mfence does serialize ALU instructions (potentially behaving identically to lfence + a store-buffer flush like a locked instruction). However, I think this is an implementation detail.

Bus locks (via the LOCK opcode prefix) produce a full fence*, however, on WC memory they don't provide the load fence, this is documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.1.2:
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
*See Intel's 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.2.3.9 for an example

Related

Does `xchg` encompass `mfence` assuming no non-temporal instructions?

I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence and xchg under the assumption of no non-temporal instructions.
The Intel instruction reference for xchg mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.
For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception. Load operations that reference weakly
ordered memory types (such as the WC memory type) may not be
serialized.
The mfence documentation claims the following.
Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction. This serializing operation guarantees that every load and
store instruction that precedes the MFENCE instruction in program
order becomes globally visible before any load or store instruction
that follows the MFENCE instruction. 1 The MFENCE instruction is
ordered with respect to all load and store instructions, other MFENCE
instructions, any LFENCE and SFENCE instructions, and any serializing
instructions (such as the CPUID instruction). MFENCE does not
serialize the instruction stream.
If we ignore weakly ordered memory types, does xchg (which implies lock) encompass all of mfence's guarantees with respect to memory ordering?
Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.
NT stores are fine.
I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.
Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).
From your quote:
(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).
mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)
NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).
It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)
These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.
I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.

Memory Protection Keys Memory Reordering

Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.
First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.
Second, does that mean wrpkru needs to be surrounded by lfence?
Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?
I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.
Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.
It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)
All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.
e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.
I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?
(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load.
This is what the Memory Order Buffer does.)
In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.
The earlier part of this answer is about ordering in assembly.

What will be used for data exchange between threads are executing on one Core with HT?

Hyper-Threading Technology is a form of simultaneous multithreading
technology introduced by Intel.
These resources include the execution engine, caches, and system bus
interface; the sharing of resources allows two logical processors to
work with each other more efficiently, and allows a stalled logical
processor to borrow resources from the other one.
In the Intel CPU with Hyper-Threading, one CPU-Core (with several ALUs) can execute instructions from 2 threads at the same clock. And both 2 threads share: store-buffer, caches L1/L2 and system bus.
But if two thread execute simultaneous on one Core, thread-1 stores atomic value and thread-2 loads this value, what will be used for this exchange: shared store-buffer, shared cache L1 / L2 or as usual cache L3?
What will be happen if both 2 threads from one the same process (the same virtual address space) and if from two different processes (the different virtual address space)?
Sandy Bridge Intel CPU - cache L1:
32 KB - cache size
64 B - cache line size
512 - lines (512 = 32 KB / 64 B)
8-way
64 - number sets of ways (64 = 512 lines / 8-way)
6 bits [11:6] - of virtual address (index) defines current set number (this is tag)
4 K - each the same (virtual address / 4 K) compete for the same set (32 KB / 8-way)
low 12 bits - significant for determining the current set number
4 KB - standard page size
low 12 bits - the same in virtual and physical addresses for each address
I think you'll get a round-trip to L1. (Not the same thing as store->load forwarding within a single thread, which is even faster than that.)
Intel's optimization manual says that store and load buffers are statically partitioned between threads, which tells us a lot about how this will work. I haven't tested most of this, so please let me know if my predictions aren't matching up with experiment.
Update: See this Q&A for some experimental testing of throughput and latency.
A store has to retire in the writing thread, and then commit to L1 from the store buffer/queue some time after that. At that point it will be visible to the other thread, and a load to that address from either thread should hit in L1. Before that, the other thread should get an L1 hit with the old data, and the storing thread should get the stored data via store->load forwarding.
Store data enters the store buffer when the store uop executes, but it can't commit to L1 until it's known to be non-speculative, i.e. it retires. But the store buffer also de-couples retirement from the ROB (the ReOrder Buffer in the out-of-order core) vs. commitment to L1, which is great for stores that miss in cache. The out-of-order core can keep working until the store buffer fills up.
Two threads running on the same core with hyperthreading can see StoreLoad re-ordering if they don't use memory fences, because store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core.
An atomic read-modify-write operation has to make its store globally visible (commit to L1) as part of its execution, otherwise it wouldn't be atomic. As long as the data doesn't cross a boundary between cache lines, it can just lock that cache line. (AFAIK this is how CPUs do typically implement atomic RMW operations like lock add [mem], 1 or lock cmpxchg [mem], rax.)
Either way, once it's done the data will be hot in the core's L1 cache, where either thread can get a cache hit from loading it.
I suspect that two hyperthreads doing atomic increments to a shared counter (or any other locked operation, like xchg [mem], eax) would achieve about the same throughput as a single thread. This is much higher than for two threads running on separate physical cores, where the cache line has to bounce between the L1 caches of the two cores (via L3).
movNT (Non-Temporal) weakly-ordered stores bypass the cache, and put their data into a line-fill buffer. They also evict the line from L1 if it was hot in cache to start with. They probably have to retire before the data goes into a fill buffer, so a load from the other thread probably won't see it at all until it enters a fill-buffer. Then probably it's the same as an movnt store followed by a load inside a single thread. (i.e. a round-trip to DRAM, a few hundred cycles of latency). Don't use NT stores for a small piece of data you expect another thread to read right away.
L1 hits are possible because of the way Intel CPUs share the L1 cache. Intel uses virtually indexed, physically tagged (VIPT) L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) But since the index bits (which select a set of 8 tags) are below the page-offset, it behaves exactly like a PIPT cache (think of it as translation of the low 12 bits being a no-op), but with the speed advantage of a VIPT cache: it can fetch the tags from a set in parallel with the TLB lookup to translate the upper bits. See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer.
Since L1d cache behaves like PIPT, and the same physical address really means the same memory, it doesn't matter whether it's 2 threads of the same process with the same virtual address for a cache line, or whether it's two separate processes mapping a block of shared memory to different addresses in each process. This is why L1d can be (and is) competitively by both hyperthreads without risk of false-positive cache hits. Unlike the dTLB, which needs to tag its entries with a core ID.
A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.
Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.
There is a "shared mode":
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the
logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache
can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this
reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology
It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.
The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way Nehalem and SnB-family CPUs work.
It would be possible in theory to build an OOO SMT CPU core that made stores from one thread visible to the other as soon as they retired, but before they leaves the store buffer and commit to L1d (i.e. before they become globally visible). This is not how Intel's designs work, since they statically partition the store queue instead of competitively sharing it.
Even if the threads shared one store-buffer, store forwarding between threads for stores that haven't retired yet couldn't be allowed because they're still speculative at that point. That would tie the two threads together for branch mispredicts and other rollbacks.
Using a shared store queue for multiple hardware threads would take extra logic to always forward to loads from the same thread, but only forward retired stores to loads from the other thread(s). Besides transistor count, this would probably have a significant power cost. You couldn't just omit store-forwarding entirely for non-retired stores, because that would break single-threaded code.
Some POWER CPUs may actually do this; it seems like the most likely explanation for not all threads agreeing on a single global order for stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
As #BeeOnRope points out, this wouldn't work for an x86 CPU, only for an ISA that doesn't guarantee a Total Store Order, because this this would let the SMT sibling(s) see your store before it becomes globally visible to other cores.
TSO could maybe be preserved by treating data from sibling store-buffers as speculative, or not able to happen before any cache-miss loads (because lines that stay hot in your L1D cache can't contain new stores from other cores). IDK, I haven't thought this through fully. It seems way overcomplicated and probably not able to do useful forwarding while maintaining TSO, even beyond the complications of having a shared store-buffer or probing sibling store-buffers.

With Hyper Threading, threads of one physical core are exchanging via what level of cache L1/L2/L3?

Does the Hyper Threading allow to use of L1-cache to exchange the data between the two threads, which are executed simultaneously on a single physical core, but in two virtual cores?
With the proviso that both belong to the same process, i.e. in the same address space.
Page 85 (2-55) - Intel® 64 and IA-32 Architectures Optimization Reference Manual: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
2.5.9 Hyper-Threading Technology Support in Intel® Microarchitecture Code Name Nehalem
...
Deeper buffering and enhanced resource sharing/partition policies:
Replicated resource for HT operation: register state, renamed return stack buffer, large-page ITLB.
Partitioned resources for HT operation: load buffers, store buffers, re-order buffers, small-page ITLB are statically allocated between two logical processors.
Competitively-shared resource during HT operation: the reservation station, cache hierarchy, fill buffers, both DTLB0 and STLB.
Alternating during HT operation: front end operation generally alternates between two logical processors to ensure fairness.
HT unaware resources: execution units.
The Intel Architecture Software Optimization manual has a brief description of how processor resources are shared between HT threads on a core in chapter 2.3.9. Documented for the Nehalem architecture, getting stale but fairly likely to still be relevant for current ones since the partitioning is logically consistent:
Duplicated for each HT thread: the registers, the return stack buffer, the large-page ITLB
Statically allocated for each HT thread: the load, store and re-order buffers, the small-page ITLB
Competitively shared between HT threads: the reservation station, the caches, the fill buffers, DTLB0 and STLB.
Your question matches the 3rd bullet. In the very specific case of each HT thread executing code from the same process, a bit of an accident, you can generally expect L1 and L2 to contain data retrieved by one HT thread that can be useful to the other. Keep in mind that the unit of storage in the caches is a cache-line, 64 bytes. Just in case: this is not otherwise a good reason to pursue a thread-scheduling approach that favors getting two HT threads to execute on the same core, assuming your OS would support that. An HT thread generally runs quite a bit slower than a thread that gets the core to itself. 30% is the usual number bandied about, YMMV.

Is it necessary to flush write combine memory explicitly by programmer?

I know that write combine writes will be cached, and don't reach the memory directly.
But is it necessary for the programmer to flush this memory explicitly before others can access?
I got this question from the graphics driver code. For example, CPU fills the vertex buffer(mapped as WC). But before GPU access it, I don't see any flush operation in the code.
Have the architecture(x86) already taken care of this for us? Any more detail document about this?
According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1 (August 2012 version, but this should not have changed), Section 11.3.1, the buffer must be flushed:
The protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for system memory coherency. When using the WC memory type, software must be sensitive to the fact that the writing of data to system memory is being delayed and must deliberately empty the WC buffers when system memory coherency is required.
If the graphics drivers did not actually flush the write combining buffers, then they were depending on system specific timing and/or buffer sizes (while assuming that subsequent WC writes will be allocated to the buffer, this is not architecturally guaranteed). This may work (or appear to work) on existing systems under ordinary workloads, but it is not architecturally guaranteed to work.
Since a broad range of serializing events will flush the write combining buffers, it is quite possible that the flush operation/event is present but not obvious (as an SFENCE would be). From Intel® 64 and IA-32 Architectures Software Developer’s Manual (version 052, September 2014), Volume 3, Section 11.3 Methods of Caching Available:
If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.
For example, a write to a GPU register (if mapped to uncached memory) would flush the write combining buffer.

Resources