If write to the remote memory over PCIe which marked as WC(Write Combined), then do we have any consistency automatically? - multithreading

As we know on x86 architecture the acquire-release consistency provided automatically - i.e. all operations automatically ordered without any fences, exclude first store and next load operations from different locations. (As said Herb Sutter on page 34: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c )
And as we know when we write to the remote WC-marked memory over FSB, then CPU uses temporary buffer with size 64 bytes - WCB (Write Combined Buffer)/BIU (Bus Interface Unit). And "When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed." i.e. we have not the automatically acquire-release consistency - qutoted from If we marked memory as WC(Write Combined), then do we have any consistency automatically?
See "WCB FSB Transactions" on page 1080 for more information.
But what will happen if we write to the remote WC-marked memory over PCI Express, will we have the automatically acquire-release consistency, when we use MOV or SSE?

There is no such thing as reordering across different contexts since there's no original order across such writes (aside from anything explicitly maintained by synchronization methods).
In other words - if core1 and core2 each write a line, these lines can be observed in any order without braking consistency. The prohibition is on different cores observing different orders for these two lines (i.e. core3 sees the line from core1 first, and core4 sees core2 first). Even that is limited to other cores, cores1 and 2 may each see its own write ahead of the global order (this is a relaxation that x86 does compared to sequential consistency, to allow intra-core forwarding).
What can be potentially reordered are stores within a given program context. Here the order does matter of course, so a program doing -
thread 0 | thread 1
store [x] <-- 1 | load [y]
store [y] <-- 1 | load [x]
Under the normal x86 memory model (considered to be TSO-like) must preserve that a result of x==0 and y==1 is impossible (assume both were initially zero), since that implies that the stores were reordered. To avoid that, stores will be dispatched in the order maintained by the core's internal queues - even though the execution is done out-of-order, the store may only be seen by the outside world after it is committed (a stage where the reordering buffer restores the original program order). This also guarantees that the store will not be seen if an earlier instruction had an unexpected exception or a branch misprediction.
On the other hand, write-combining allows a more lenient memory ordering model, so stores may be combined and committed whenever the write-combining buffer has the full line. This reduced the bandwidth but allows stores to reorder, for e.g.
store [x] <-- ..
store [z] <-- ..
store [x+8] <-- ..
store [x+16] <-- ..
...
the 2nd store may be reordered ahead of the 1st, since the 1st will wait for the write-combining buffer to fill up. Once the buffer is full (although there's no enforced limit to that), the line is sent out to memory, regardless of any path it has to travel.
The comment about FSB in that other answer doesn't mean it's specific, it dates back to a Pentium 4 guide, so after passing the last level cache, they just assume you go on the FSB. The terms are different nowadays, but anyway - nobody out there cares about ordering any lines, and as I said - once you're no longer within the core, there's no notion of order, only coherency. They just meant that once the line is out it may be observed, and that's the point where the order breaking becomes visible.

Related

Can one CPU core observe others' modification immediately? [duplicate]

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.
I am curious about:
To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?
Your terminology is unusual. You say "finish the cache coherence"; what actually happens is that the core has to get (exclusive) ownership of the cache line before it can modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.
So yes, you do "finish the cache coherence" = get exclusive ownership before the store can even enter cache and become globally visible = available for requests to share that cache line. Cache always maintains coherence (that's the point of MESI), not gets out of sync and then wait for coherence. I think your confusion stems from your mental model not matching that reality.
(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the stores from two other cores in the same order; that can happen by private store-forwarding between SMT threads on one physical core letting another logical core see a store ahead of commit to L1d = global visibility.)
I think you know some of this, but let me start from the basics.
L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).
Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.
Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:
It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).
It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.
Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.
All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.
The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.
See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.
Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.
Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).
Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.
Back to the actual question:
Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.
Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".
The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.
The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.
The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)
On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)
The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.
Yes in a model like x86-TSO stores are likely committed to the L1 in program order, and Peter's answer covers it well. That is, the store buffer is maintained in program order, and the core will commit only the oldest store (or perhaps several consecutive oldest stores if they are all going to the same cache line) to L1 before moving on.1
However, you mention in the comments your concern that this might impact performance by essentially making the store buffer commit a blocking (serialized) process:
And why I am confused about this problem is that cache controller
could handle the requests in a non-blocking way. But, to conform to
the TSO and make sure data globally visible on a multi-core system,
should cache controller follow the store ordering? Because if there
are two variable A and B being updated sequentially on core 1 and core
2 get the updated B from core 1, then core 2 must also can see the
updated A. And to achieve this, I think the private cache hierarchy on
core 1 have to finishes the cache coherence of the variable A and B in
order and make them globally visible. Am I right?
The good news is that even though the store buffer might commit in a ordered way only the oldest store to L1, it can still get plenty of parallelism with respect to the rest of the memory subsystem by looking ahead in the store buffer and making prefetch RFO requests: trying to get the line in the E state in the local core even before the store first in line to commit to L1.
This approach doesn't violate ordering, since the stores are still written in program order, but it allows full parallelism when resolving L1 store misses. It is L1 store misses that really matter anyways: stores hits in L1 can commit rapidly, at least 1 per cycle, so committing a bunch of hits doens't help much: but getting MLP on store misses is very important, especially for scattered stores the prefetcher can't deal with.
Do x86 chips actually use a technique like this? Almost certainly. Most convincingly, tests of a long series of random writes show a much better average latency than the full memory latency, implying MLP significantly better than one. You can also find patents like this one or this one where Intel describes pretty much exactly this method.
Still, nothing is perfect. There is some evidence that ordering concerns causes weird performance hiccups when stores are missing L1, even if they hit in L2.
1 It is certainly possible that it can commit stores out of order if in maintains the illusion of in-order commit, e.g., by not relinquishing ownership of cache lines written out of order until order is restored, but this is prone to deadlocks and other complicated cases, and I have no evidence that x86 does so.

Why memory reordering is not a problem on single core/processor machines?

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

What happens when different CPU cores write to the same RAM address without synchronization?

Let's assume that 2 cores are trying to write different values to the same RAM address (1 byte), at the same moment of time (plus-minus eta), and without using any interlocked instructions or memory barriers. What happens in this case and what value will be written to the main RAM? The first one wins? The last one wins? Undetermined behavior?
x86 (like every other mainstream SMP CPU architecture) has coherent data caches. It's impossible for two difference caches (e.g. L1D of 2 different cores) to hold conflicting data for the same cache line.
The hardware imposes an order (by some implementation-specific mechanism to break ties in case two requests for ownership arrive in the same clock cycle from different cores). In most modern x86 CPUs, the first store won't be written to RAM, because there's a shared write-back L3 cache to absorb coherency traffic without a round-trip to memory.
Loads that appear after both the stores in the global order will see the value stored by whichever store went second.
(I'm assuming we're talking about normal (not NT) stores to cacheable memory regions (WB, not USWC, UC, or even WT). The basic idea would be the same in either case, though; one store would go first, the next would step on it. The data from the first store could be observed temporarily if a load happened to get between them in the global order, but otherwise the data from the store that the hardware chose to do 2nd would be the long-term effect.
We're talking about a single byte, so the store can't be split across two cache lines, and thus every address is naturally aligned so everything in Why is integer assignment on a naturally aligned variable atomic on x86? applies.
Coherency is maintained by requiring a core to acquire exclusive access to that cache line before it can modify it (i.e. make a store globally visible by committing it from the store queue to L1D cache).
This "acquiring exclusive access" stuff is done using (a variant of) the MESI protocol. Any given line in a cache can be Modified (dirty), Exclusive (owned by not yet written), Shared (clean copy; other caches may also have copies so an RFO (Read / Request For Ownership) is required before write), or Invalid. MESIF (Intel) / MOESI (AMD) add extra states to optimize the protocol, but don't change the fundamental logic that only one core can change a line at any one time.
If we cared about ordering of multiple changes to two different lines, then memory ordering an memory barriers would come into play. But none of that matters for this question about "which store wins" when the stores execute or retire in the same clock cycle.
When a store executes, it goes into the store queue. It can commit to L1D and become globally visible at any time after it retires, but not before; unretired instructions are treated as speculative and thus their architectural effects must not be visible outside the CPU core. Speculative loads have no architectural effect, only microarchitectural1.
So if both stores become ready to commit at "the same time" (clocks are not necessarily synchronized between cores), one or the other will have its RFO succeed first and gain exclusive access, and make its store data globally visible. Then, soon after, the other core's RFO will succeed and update the cache line with its data, so its store comes second in the global store order observed by all other cores.
x86 has a total-store-order memory model where all cores observe the same order even for stores to different cache lines (except for always seeing their own stores in program order). Some weakly-ordered architectures like PowerPC would allow some cores to see a different total order from other cores, but this reordering can only happen between stores to different lines. There is always a single modification order for a single cache line. (Reordering of loads with respect to each other and other stores means that you have to be careful how you go about observing things on a weakly ordered ISA, but there is a single order of modification for a cache line, imposed by MESI).
Which one wins the race might depend on something as prosaic as the layout of the cores on the ring bus relative to which slice of shared L3 cache that line maps to. (Note the use of the word "race": this is the kind of race which "race condition" bugs describe. It's not always wrong to write code where two unsynchronized stores update the same location and you don't care which one wins, but it's rare.)
BTW, modern x86 CPUs have hardware arbitration for the case when multiple cores contend for atomic read-modify-write to the same cache line (and thus are holding onto it for multiple clock cycles to make lock add byte [rdi], 1 atomic), but regular loads/stores only need to own a cache line for a single cycle to execute a load or commit a store. I think the arbitration for locked instructions is a different thing from which core wins when multiple cores are trying to commit stores to the same cache line. Unless you use a pause instruction, cores assume that other cores aren't modifying the same cache line, and speculatively load early, and thus will suffer memory-ordering mis-speculation if it does happen. (What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
IDK if anything similar happens when two threads are both just storing without loading, but probably not because stores aren't speculatively reordered and are decoupled from out-of-order execution by the store queue. Once a store instruction retires, the store is definitely going to happen, so OoO exec doesn't have to wait for it to actually commit. (And in fact it has to retirem from the OoO core before it can commit, because that's how the CPU knows it's non-speculative; i.e. that no earlier instruction faulted or was a mispredicted branch)
Footnotes:
Spectre blurs that line by using a cache-timing attack to read microarchitectural state into the architectural state.
They will wind up being sequenced, likely between the L1 caches. One write will come first and the other will come second. Whichever one comes second will be the result that subsequent reads will see.

Atomic write of nearby one-byte variables

Suppose, on a multiprocessor machine, there are two global variables A and B, each one byte in size, located near each other in memory, and two CPUs executing the following code.
CPU 1:
read A
calculate new value
write A
CPU 2:
read B
calculate new value
write B
Just looking at what would tend to physically happen, we would expect the above would be incorrect without any explicit locking because A and B could be in the same cache line, and CPU 1 needs to read the entire cache line, change the value of a single byte and write the line again; if CPU 2 does its read-modify-write of the cache line in between, the update to B could be lost. (I'm assuming it doesn't matter what order A and B are updated in, I'm only concerned with making sure neither update is lost.)
But x86 guarantees this code is okay. On x86, a write to a single variable only becomes non-atomic if that variable is misaligned or bigger than the CPU word size.
Does an x86 CPU automatically carry out extra locking on the front side bus in order to make such individual variable updates, work correctly without explicit locking?
This code is correct because of cache coherency protocol. When CPU1 modifies cache line, this line became Invalid in the cache of CPU 2, and CPU 2 can't write B and must wait (See https://en.wikipedia.org/wiki/MESIF_protocol for the state machine).
So no updates are lost, and no bus locks required.
The code is correct because the standard provides the following guarantee (1.7.3):
Two or more threads of execution can access separate memory locations without interfering with each other.
It is possible that the variables share the same cache line. That may lead to false sharing, i.e. each core invalidates the cache line upon a write and
other cores that access the same cache line will have to get their data from memory higher up in the chain.
That will slow things down, but from a correctness point of view, false sharing is irrelevant since separate memory locations can still be accessed without synchronization.

If we marked memory as WC(Write Combined), then do we have any consistency automatically?

As we know on x86 architecture the acquire-release consistency provided automatically - i.e. all operations automatically ordered without any fences, exclude first store and next load operations. (As said Herb Sutter on page 34: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c )
If we put MFENCE(LFENCE+SFENCE) between them, then store can't be reordered, and load can't be reordered - i.e. we provided sequential consistency.
But if we marked memory as WC(Write Combined), then do we have any consistency automatically without any fences, may be acquire-release?
Or if we use SSE instructions with WC-memory, then we have not any consistency, and if we use simple MOV instructions with WC-memory, then we have acquire-release consistency, isn't it?
As stated here: How MTRR registers implemented?
Stores to WC Memory: The WC memory type is well-suited to an area of memory (e.g., the video frame buffer) that has the following characteristics:
1. The processor does not cache from WC memory.
2. Speculative execution of loads from WC memory is permitted.
3. Stores to WC memory are deposited in the processor's Write Combining Buffers (WCBs).
4. Each WCB can hold one line (64 bytes of data).
5. As stores are performed to a line of WC memory space, the bytes are accumulated in the WCB assigned to record writes to that line of memory space.
6. A subsequent store to a location in a WCB can overwrite a byte that was deposited in that location by an earlier store to that location. In other words, multiple writes to the same location are collapsed so that the location reflects the last data byte written to that location.
7. When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed. The device being written to must tolerate this type of behavior (i.e., it must function correctly). See "WCB FSB Transactions" on page 1080 for more information.
I believe there is no "automatic consistency" for WC memory since the final writes to memory are "not necessarily written to memory in the same order in which the earlier programmatic stores were executed".
This is a bad idea, WC memory is very slow to read (20x slower) and requires using special SSE/AVX2 instructions to speed it up. Using MFENCE is significantly faster.
Coherency is not guaranteed as well.

Resources