Performance Counters for DRAM Accesses - performance-testing

I want to retrieve the number of DRAM accesses in my application. Precisely, I need to distinguish between data and code accesses. The processor is an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell). Based on Intel Software Developer's Manual, Volume 3 and Perf, I could find and categorize the following memory-access-related events:
(A)
LLC-load-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-stores [Hardware cache event]
=========================================================================
(B)
mem_load_uops_l3_miss_retired.local_dram
mem_load_uops_retired.l3_miss
=========================================================================
(C)
offcore_response.all_code_rd.l3_miss.any_response
offcore_response.all_code_rd.l3_miss.local_dram
offcore_response.all_data_rd.l3_miss.any_response
offcore_response.all_data_rd.l3_miss.local_dram
offcore_response.all_reads.l3_miss.any_response
offcore_response.all_reads.l3_miss.local_dram
offcore_response.all_requests.l3_miss.any_response
=========================================================================
(D)
offcore_response.all_rfo.l3_miss.any_response
offcore_response.all_rfo.l3_miss.local_dram
=========================================================================
(E)
offcore_response.demand_code_rd.l3_miss.any_response
offcore_response.demand_code_rd.l3_miss.local_dram
offcore_response.demand_data_rd.l3_miss.any_response
offcore_response.demand_data_rd.l3_miss.local_dram
offcore_response.demand_rfo.l3_miss.any_response
offcore_response.demand_rfo.l3_miss.local_dram
=========================================================================
(F)
offcore_response.pf_l2_code_rd.l3_miss.any_response
offcore_response.pf_l2_data_rd.l3_miss.any_response
offcore_response.pf_l2_rfo.l3_miss.any_response
offcore_response.pf_l3_code_rd.l3_miss.any_response
offcore_response.pf_l3_data_rd.l3_miss.any_response
offcore_response.pf_l3_rfo.l3_miss.any_response
My choices are as follows:
It seems that the sum of LLC-load-misses and LLC-store-misses
will return the whole DRAM accesses (equivalently, I could use
LLC-misses in Perf).
For data-only accesses, I used mem_load_uops_retired.l3_miss.
It does not include stores, but seems to be OK (because stores seem
to be much less frequent?!).
Simplistically, LLC-load-misses - mem_load_uops_retired.l3_miss =
DRAM Accesses for Code (As code is read-only).
Are these choices reasonable?
My other questions: (The 2nd one is the most important)
What are local_dram and any_response?
At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.
Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?
Group (D), includes DRAM access events caused by Read for Ownership operations (for Cache Coherency Protocols). It seems irrelevant to my problem.
Group (F), counts DRAM reads caused by L2-cache prefetcher which is also irrelevant to my problem.

Based on my understanding of the question, I recommend using the following two events on the specified processor:
OFFCORE_RESPONSE.ALL_READS.L3_MISS.LOCAL_DRAM: This includes all cacheable data read and write transactions and all code fetch transactions, whether the transaction is initiated by a instruction (retired or not) or a prefetch or any type. Each event represents exactly a 64-byte read request to the memory controller.
OFFCORE_RESPONSE.ALL_CODE_RD.L3_MISS.LOCAL_DRAM: This includes all the code fetch accesses to the IMC.
(I think both of these event don't occur for uncacheable code fetch requests, but I've not tested this and the documentation is not clear on this.)
The "data accesses" can be measured separately from the "code accesses" by subtracting the second event from the first. These two events can be counted simultaneously on the same logical core on Haswell without multiplexing.
There are of course other transactions that do go to the IMC but are not counted by either of the two mentioned events. These include: (1) L3 writebacks, (2) uncacheable partial reads and writes from cores, (3) full WCB evictions, and (4) memory accesses from IO devices. Depending on the workload, It's not unlikely that accesses of types (1), (3), and (4) may constitute a significant fraction of total accesses to the IMC.
It seems that the sum of LLC-load-misses and LLC-store-misses will
return the whole DRAM accesses (equivalently, I could use LLC-misses
in Perf).
Note the following:
The event LLC-load-misses is a perf event mapped to the native event OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE.
The event LLC-store-misses is mapped to OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.ANY_RESPONSE.
These are not the events you want because:
The ANY_RESPONSE bit indicates that the event can occur for requests that target any unit, not just the IMC.
These events count L1 data prefetches and page walk requests, but not L2 data prefetches. You'd want to count all prefetches that consume memory bandwdith in general.
For data-only accesses, I used mem_load_uops_retired.l3_miss. It does
not include stores, but seems to be OK (because stores seem to be much
less frequent?!).
There are a number of issues with using mem_load_uops_retired.l3_miss on Haswell:
There are cases where this event is unreliable, so it should be avoided if there are alternatives. Otherwise, the analysis methodology should take in to account the potential unreliability of this event count.
The event only occurs for requests from retired loads and it omits speculative loads and all stores, which can be significant.
Doing arithmetic with this events and other events in a meaningful way is not easy. For example, your suggestion of doing "LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code" is incorrect.
What are local_dram and any_response?
Not all requests that miss in the L3 go to the IMC. A typical example is memory-mapped IO requests. You said you only want the core-originated requests that go to the IMC, so local_dram is the right bit.
At first, it seems that, group (C), is a higher resolution version of
the load events of group (A). But my tests show that the events in the
former group is much more frequent than the latter. For example, in a
simple benchmark, the number of
offcore_response.all_reads.l3_miss.any_response events were twice as
many as LLC-load-misses.
This is normal because offcore_response.all_reads.l3_miss.any_response is inclusive of LLC-load-misses and can easily be significantly larger.
Group (E), pertains to demand reads (i.e., all non-prefetched reads).
Does this mean that, e.g.:
offcore_response.all_data_rd.l3_miss.any_response -
offcore_response.demand_data_rd.l3_miss.any_response = DRAM read
accesses caused by prefeching?
No, because:
the any_response bit as explained above,
this subtraction results in only the L2 data load prefetches, not all data load hardware and software prefetches.

Related

Is linux perf accurate for measuring cache misses for multithread C program?

Can linux perf measure cache misses for multithread program, or it can only report the result for master thread? I used it on a C program using pthread, it seemed the cache miss number was lower than the expected number.
Yes, perf stat is an accurate total across all threads. (Unless your CPU has an erratum where a certain PMU event over or under-counts. These do happen, more often than correctness bugs for actual architectural state, so check the errata sheet, aka "spec update" for Intel CPUs.)
Make sure you understand exactly what each cache event counts, though, e.g. L1d-misses counts l1d.replacement on a modern Intel like Skylake, so multiple misses on the same line are only one replacement. (How does Linux perf calculate the cache-references and cache-misses events).
Also note that HW prefetch can avoid a lot of misses for sequential access, if memory can keep up. Also related: L2 instruction fetch misses much higher than L1 instruction fetch misses
Also related: Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events goes into some detail about what exactly those specific events count.
Performance Counters for DRAM Accesses
What is the meaning of Perf events: dTLB-loads and dTLB-stores?
Hardware cache events and perf

Can one CPU core observe others' modification immediately? [duplicate]

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.
I am curious about:
To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?
Your terminology is unusual. You say "finish the cache coherence"; what actually happens is that the core has to get (exclusive) ownership of the cache line before it can modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.
So yes, you do "finish the cache coherence" = get exclusive ownership before the store can even enter cache and become globally visible = available for requests to share that cache line. Cache always maintains coherence (that's the point of MESI), not gets out of sync and then wait for coherence. I think your confusion stems from your mental model not matching that reality.
(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the stores from two other cores in the same order; that can happen by private store-forwarding between SMT threads on one physical core letting another logical core see a store ahead of commit to L1d = global visibility.)
I think you know some of this, but let me start from the basics.
L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).
Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.
Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:
It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).
It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.
Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.
All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.
The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.
See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.
Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.
Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).
Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.
Back to the actual question:
Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.
Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".
The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.
The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.
The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)
On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)
The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.
Yes in a model like x86-TSO stores are likely committed to the L1 in program order, and Peter's answer covers it well. That is, the store buffer is maintained in program order, and the core will commit only the oldest store (or perhaps several consecutive oldest stores if they are all going to the same cache line) to L1 before moving on.1
However, you mention in the comments your concern that this might impact performance by essentially making the store buffer commit a blocking (serialized) process:
And why I am confused about this problem is that cache controller
could handle the requests in a non-blocking way. But, to conform to
the TSO and make sure data globally visible on a multi-core system,
should cache controller follow the store ordering? Because if there
are two variable A and B being updated sequentially on core 1 and core
2 get the updated B from core 1, then core 2 must also can see the
updated A. And to achieve this, I think the private cache hierarchy on
core 1 have to finishes the cache coherence of the variable A and B in
order and make them globally visible. Am I right?
The good news is that even though the store buffer might commit in a ordered way only the oldest store to L1, it can still get plenty of parallelism with respect to the rest of the memory subsystem by looking ahead in the store buffer and making prefetch RFO requests: trying to get the line in the E state in the local core even before the store first in line to commit to L1.
This approach doesn't violate ordering, since the stores are still written in program order, but it allows full parallelism when resolving L1 store misses. It is L1 store misses that really matter anyways: stores hits in L1 can commit rapidly, at least 1 per cycle, so committing a bunch of hits doens't help much: but getting MLP on store misses is very important, especially for scattered stores the prefetcher can't deal with.
Do x86 chips actually use a technique like this? Almost certainly. Most convincingly, tests of a long series of random writes show a much better average latency than the full memory latency, implying MLP significantly better than one. You can also find patents like this one or this one where Intel describes pretty much exactly this method.
Still, nothing is perfect. There is some evidence that ordering concerns causes weird performance hiccups when stores are missing L1, even if they hit in L2.
1 It is certainly possible that it can commit stores out of order if in maintains the illusion of in-order commit, e.g., by not relinquishing ownership of cache lines written out of order until order is restored, but this is prone to deadlocks and other complicated cases, and I have no evidence that x86 does so.

What happens when different CPU cores write to the same RAM address without synchronization?

Let's assume that 2 cores are trying to write different values to the same RAM address (1 byte), at the same moment of time (plus-minus eta), and without using any interlocked instructions or memory barriers. What happens in this case and what value will be written to the main RAM? The first one wins? The last one wins? Undetermined behavior?
x86 (like every other mainstream SMP CPU architecture) has coherent data caches. It's impossible for two difference caches (e.g. L1D of 2 different cores) to hold conflicting data for the same cache line.
The hardware imposes an order (by some implementation-specific mechanism to break ties in case two requests for ownership arrive in the same clock cycle from different cores). In most modern x86 CPUs, the first store won't be written to RAM, because there's a shared write-back L3 cache to absorb coherency traffic without a round-trip to memory.
Loads that appear after both the stores in the global order will see the value stored by whichever store went second.
(I'm assuming we're talking about normal (not NT) stores to cacheable memory regions (WB, not USWC, UC, or even WT). The basic idea would be the same in either case, though; one store would go first, the next would step on it. The data from the first store could be observed temporarily if a load happened to get between them in the global order, but otherwise the data from the store that the hardware chose to do 2nd would be the long-term effect.
We're talking about a single byte, so the store can't be split across two cache lines, and thus every address is naturally aligned so everything in Why is integer assignment on a naturally aligned variable atomic on x86? applies.
Coherency is maintained by requiring a core to acquire exclusive access to that cache line before it can modify it (i.e. make a store globally visible by committing it from the store queue to L1D cache).
This "acquiring exclusive access" stuff is done using (a variant of) the MESI protocol. Any given line in a cache can be Modified (dirty), Exclusive (owned by not yet written), Shared (clean copy; other caches may also have copies so an RFO (Read / Request For Ownership) is required before write), or Invalid. MESIF (Intel) / MOESI (AMD) add extra states to optimize the protocol, but don't change the fundamental logic that only one core can change a line at any one time.
If we cared about ordering of multiple changes to two different lines, then memory ordering an memory barriers would come into play. But none of that matters for this question about "which store wins" when the stores execute or retire in the same clock cycle.
When a store executes, it goes into the store queue. It can commit to L1D and become globally visible at any time after it retires, but not before; unretired instructions are treated as speculative and thus their architectural effects must not be visible outside the CPU core. Speculative loads have no architectural effect, only microarchitectural1.
So if both stores become ready to commit at "the same time" (clocks are not necessarily synchronized between cores), one or the other will have its RFO succeed first and gain exclusive access, and make its store data globally visible. Then, soon after, the other core's RFO will succeed and update the cache line with its data, so its store comes second in the global store order observed by all other cores.
x86 has a total-store-order memory model where all cores observe the same order even for stores to different cache lines (except for always seeing their own stores in program order). Some weakly-ordered architectures like PowerPC would allow some cores to see a different total order from other cores, but this reordering can only happen between stores to different lines. There is always a single modification order for a single cache line. (Reordering of loads with respect to each other and other stores means that you have to be careful how you go about observing things on a weakly ordered ISA, but there is a single order of modification for a cache line, imposed by MESI).
Which one wins the race might depend on something as prosaic as the layout of the cores on the ring bus relative to which slice of shared L3 cache that line maps to. (Note the use of the word "race": this is the kind of race which "race condition" bugs describe. It's not always wrong to write code where two unsynchronized stores update the same location and you don't care which one wins, but it's rare.)
BTW, modern x86 CPUs have hardware arbitration for the case when multiple cores contend for atomic read-modify-write to the same cache line (and thus are holding onto it for multiple clock cycles to make lock add byte [rdi], 1 atomic), but regular loads/stores only need to own a cache line for a single cycle to execute a load or commit a store. I think the arbitration for locked instructions is a different thing from which core wins when multiple cores are trying to commit stores to the same cache line. Unless you use a pause instruction, cores assume that other cores aren't modifying the same cache line, and speculatively load early, and thus will suffer memory-ordering mis-speculation if it does happen. (What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
IDK if anything similar happens when two threads are both just storing without loading, but probably not because stores aren't speculatively reordered and are decoupled from out-of-order execution by the store queue. Once a store instruction retires, the store is definitely going to happen, so OoO exec doesn't have to wait for it to actually commit. (And in fact it has to retirem from the OoO core before it can commit, because that's how the CPU knows it's non-speculative; i.e. that no earlier instruction faulted or was a mispredicted branch)
Footnotes:
Spectre blurs that line by using a cache-timing attack to read microarchitectural state into the architectural state.
They will wind up being sequenced, likely between the L1 caches. One write will come first and the other will come second. Whichever one comes second will be the result that subsequent reads will see.

What will be used for data exchange between threads are executing on one Core with HT?

Hyper-Threading Technology is a form of simultaneous multithreading
technology introduced by Intel.
These resources include the execution engine, caches, and system bus
interface; the sharing of resources allows two logical processors to
work with each other more efficiently, and allows a stalled logical
processor to borrow resources from the other one.
In the Intel CPU with Hyper-Threading, one CPU-Core (with several ALUs) can execute instructions from 2 threads at the same clock. And both 2 threads share: store-buffer, caches L1/L2 and system bus.
But if two thread execute simultaneous on one Core, thread-1 stores atomic value and thread-2 loads this value, what will be used for this exchange: shared store-buffer, shared cache L1 / L2 or as usual cache L3?
What will be happen if both 2 threads from one the same process (the same virtual address space) and if from two different processes (the different virtual address space)?
Sandy Bridge Intel CPU - cache L1:
32 KB - cache size
64 B - cache line size
512 - lines (512 = 32 KB / 64 B)
8-way
64 - number sets of ways (64 = 512 lines / 8-way)
6 bits [11:6] - of virtual address (index) defines current set number (this is tag)
4 K - each the same (virtual address / 4 K) compete for the same set (32 KB / 8-way)
low 12 bits - significant for determining the current set number
4 KB - standard page size
low 12 bits - the same in virtual and physical addresses for each address
I think you'll get a round-trip to L1. (Not the same thing as store->load forwarding within a single thread, which is even faster than that.)
Intel's optimization manual says that store and load buffers are statically partitioned between threads, which tells us a lot about how this will work. I haven't tested most of this, so please let me know if my predictions aren't matching up with experiment.
Update: See this Q&A for some experimental testing of throughput and latency.
A store has to retire in the writing thread, and then commit to L1 from the store buffer/queue some time after that. At that point it will be visible to the other thread, and a load to that address from either thread should hit in L1. Before that, the other thread should get an L1 hit with the old data, and the storing thread should get the stored data via store->load forwarding.
Store data enters the store buffer when the store uop executes, but it can't commit to L1 until it's known to be non-speculative, i.e. it retires. But the store buffer also de-couples retirement from the ROB (the ReOrder Buffer in the out-of-order core) vs. commitment to L1, which is great for stores that miss in cache. The out-of-order core can keep working until the store buffer fills up.
Two threads running on the same core with hyperthreading can see StoreLoad re-ordering if they don't use memory fences, because store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core.
An atomic read-modify-write operation has to make its store globally visible (commit to L1) as part of its execution, otherwise it wouldn't be atomic. As long as the data doesn't cross a boundary between cache lines, it can just lock that cache line. (AFAIK this is how CPUs do typically implement atomic RMW operations like lock add [mem], 1 or lock cmpxchg [mem], rax.)
Either way, once it's done the data will be hot in the core's L1 cache, where either thread can get a cache hit from loading it.
I suspect that two hyperthreads doing atomic increments to a shared counter (or any other locked operation, like xchg [mem], eax) would achieve about the same throughput as a single thread. This is much higher than for two threads running on separate physical cores, where the cache line has to bounce between the L1 caches of the two cores (via L3).
movNT (Non-Temporal) weakly-ordered stores bypass the cache, and put their data into a line-fill buffer. They also evict the line from L1 if it was hot in cache to start with. They probably have to retire before the data goes into a fill buffer, so a load from the other thread probably won't see it at all until it enters a fill-buffer. Then probably it's the same as an movnt store followed by a load inside a single thread. (i.e. a round-trip to DRAM, a few hundred cycles of latency). Don't use NT stores for a small piece of data you expect another thread to read right away.
L1 hits are possible because of the way Intel CPUs share the L1 cache. Intel uses virtually indexed, physically tagged (VIPT) L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) But since the index bits (which select a set of 8 tags) are below the page-offset, it behaves exactly like a PIPT cache (think of it as translation of the low 12 bits being a no-op), but with the speed advantage of a VIPT cache: it can fetch the tags from a set in parallel with the TLB lookup to translate the upper bits. See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer.
Since L1d cache behaves like PIPT, and the same physical address really means the same memory, it doesn't matter whether it's 2 threads of the same process with the same virtual address for a cache line, or whether it's two separate processes mapping a block of shared memory to different addresses in each process. This is why L1d can be (and is) competitively by both hyperthreads without risk of false-positive cache hits. Unlike the dTLB, which needs to tag its entries with a core ID.
A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.
Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.
There is a "shared mode":
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the
logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache
can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this
reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology
It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.
The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way Nehalem and SnB-family CPUs work.
It would be possible in theory to build an OOO SMT CPU core that made stores from one thread visible to the other as soon as they retired, but before they leaves the store buffer and commit to L1d (i.e. before they become globally visible). This is not how Intel's designs work, since they statically partition the store queue instead of competitively sharing it.
Even if the threads shared one store-buffer, store forwarding between threads for stores that haven't retired yet couldn't be allowed because they're still speculative at that point. That would tie the two threads together for branch mispredicts and other rollbacks.
Using a shared store queue for multiple hardware threads would take extra logic to always forward to loads from the same thread, but only forward retired stores to loads from the other thread(s). Besides transistor count, this would probably have a significant power cost. You couldn't just omit store-forwarding entirely for non-retired stores, because that would break single-threaded code.
Some POWER CPUs may actually do this; it seems like the most likely explanation for not all threads agreeing on a single global order for stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
As #BeeOnRope points out, this wouldn't work for an x86 CPU, only for an ISA that doesn't guarantee a Total Store Order, because this this would let the SMT sibling(s) see your store before it becomes globally visible to other cores.
TSO could maybe be preserved by treating data from sibling store-buffers as speculative, or not able to happen before any cache-miss loads (because lines that stay hot in your L1D cache can't contain new stores from other cores). IDK, I haven't thought this through fully. It seems way overcomplicated and probably not able to do useful forwarding while maintaining TSO, even beyond the complications of having a shared store-buffer or probing sibling store-buffers.

If write to the remote memory over PCIe which marked as WC(Write Combined), then do we have any consistency automatically?

As we know on x86 architecture the acquire-release consistency provided automatically - i.e. all operations automatically ordered without any fences, exclude first store and next load operations from different locations. (As said Herb Sutter on page 34: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c )
And as we know when we write to the remote WC-marked memory over FSB, then CPU uses temporary buffer with size 64 bytes - WCB (Write Combined Buffer)/BIU (Bus Interface Unit). And "When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed." i.e. we have not the automatically acquire-release consistency - qutoted from If we marked memory as WC(Write Combined), then do we have any consistency automatically?
See "WCB FSB Transactions" on page 1080 for more information.
But what will happen if we write to the remote WC-marked memory over PCI Express, will we have the automatically acquire-release consistency, when we use MOV or SSE?
There is no such thing as reordering across different contexts since there's no original order across such writes (aside from anything explicitly maintained by synchronization methods).
In other words - if core1 and core2 each write a line, these lines can be observed in any order without braking consistency. The prohibition is on different cores observing different orders for these two lines (i.e. core3 sees the line from core1 first, and core4 sees core2 first). Even that is limited to other cores, cores1 and 2 may each see its own write ahead of the global order (this is a relaxation that x86 does compared to sequential consistency, to allow intra-core forwarding).
What can be potentially reordered are stores within a given program context. Here the order does matter of course, so a program doing -
thread 0 | thread 1
store [x] <-- 1 | load [y]
store [y] <-- 1 | load [x]
Under the normal x86 memory model (considered to be TSO-like) must preserve that a result of x==0 and y==1 is impossible (assume both were initially zero), since that implies that the stores were reordered. To avoid that, stores will be dispatched in the order maintained by the core's internal queues - even though the execution is done out-of-order, the store may only be seen by the outside world after it is committed (a stage where the reordering buffer restores the original program order). This also guarantees that the store will not be seen if an earlier instruction had an unexpected exception or a branch misprediction.
On the other hand, write-combining allows a more lenient memory ordering model, so stores may be combined and committed whenever the write-combining buffer has the full line. This reduced the bandwidth but allows stores to reorder, for e.g.
store [x] <-- ..
store [z] <-- ..
store [x+8] <-- ..
store [x+16] <-- ..
...
the 2nd store may be reordered ahead of the 1st, since the 1st will wait for the write-combining buffer to fill up. Once the buffer is full (although there's no enforced limit to that), the line is sent out to memory, regardless of any path it has to travel.
The comment about FSB in that other answer doesn't mean it's specific, it dates back to a Pentium 4 guide, so after passing the last level cache, they just assume you go on the FSB. The terms are different nowadays, but anyway - nobody out there cares about ordering any lines, and as I said - once you're no longer within the core, there's no notion of order, only coherency. They just meant that once the line is out it may be observed, and that's the point where the order breaking becomes visible.

Resources