I'm reading about cache control protocol as it is documented in Intel Manual Vol.3(p11). The thing that is unclear is snooping memory access. Here is its description:
Beginning with the P6 family processors, if a processor detects
(through snooping) that another processor is trying to access a memory
location that it has modified in its cache, but has not yet written
back to system memory, the snooping processor will signal the other
processor (by means of the HITM# signal) that the cache line is held
in modified state and will preform an implicit write-back of the
modified data. The implicit write-back is transferred directly to the
initial requesting processor and snooped by the memory controller to
assure that system memory has been updated. Here, the processor with
the valid data may pass the data to the other processors without
actually writing it to system memory; however, it is the
responsibility of the memory controller to snoop this operation and
update memory.
Consider the following cache state (according to MESI):
CPU1 CPU2 CPU3
Line1 M I -
CPU2 writes to Line1.
This is how I see snooping:
CPU2 performs write to a memory that is cached in Line1.
CPU1 snoops around and finds the write performed by CPU2.
CPU1 signals other CPUs with HITM# that the Line1 is in modified state
CPU1 performs write-back of Line1 data OR
since CPU1 has valid data at the moment the data will be transffered to both CPU2 and CPU3 transferring the Line1 into Shared (S) state.
CPU2 performs the actual write transferring the Line1 in its cache in the Modified (M) state and Line1 in CPU1 and CPU3 in the Invalid (I) state.
So by snooping memory access system bus access can be avoided maintaining cache coherency even in case if some CPU writes into a memory location that is in an cached in Invalid line. Is that basically what they meant by snooping?
Related
Can memory address be loaded in a cache but not in the main memory? In other words, if cache wants to write data into the main memory, can that generate page fault in x86-64 with Linux?
No, cache is transparent (and in x86-64 caches by physical address). Only load or store instructions (and code-fetch) can page fault, and that happens synchronously (on the offending instruction), not some random time later.
Of course, actual commit to L1d cache is delayed (by the store buffer) until after retirement from the out-of-order execution back-end. But checking for faults is done in the load/store execution unit (which for stores, writes data and address into the store buffer for it to definitely happen and become visible some time later.)
I understand Port I/O from a hardware abstraction level (i.e. asserts a pin that indicates to devices on the bus that the address is a port address, which makes sense on earlier CPUs with a simple address bus model) but I'm not really sure how it's implemented on modern CPUs microarchitecturally but also particularly how the Port I/O operation appears on the ring bus.
Firstly. Where does the IN/OUT instruction get allocated to, the reservation station or the load/store buffer? My initial thoughts were that it would be allocated in the load/store buffer and the memory scheduler recognises it, sends it to the L1d indicating that it is a port-mapped operation. A line fill buffer is allocated and it gets sent to L2 and then to the ring. I'm guessing that the message on the ring has some port-mapped indicator which only the system agent accepts and then it checks its internal components and relays the port-mapped indicated request to them; i.e. PCIe root bridge would pick up CF8h and CFCh. I'm guessing the DMI controller is fixed to pick up all the standardised ports that will appear on the PCH, such as the one for the legacy DMA controller.
Yes, I assume the message over the ring bus has some kind of tag that flags it as being to I/O space, not a physical memory address, and that the system agent sorts this out.
If anyone knows more details, that might be interesting, but this simple mental model is probably fine.
I don't know how port I/O turns into PCIe messages, but I think PCIe devices can have I/O ports in I/O space, not just MMIO.
IN/OUT are pretty close to serializing (but not officially defined using that term for some reason How many memory barriers instructions does an x86 CPU have?). They do drain the store buffer before executing, and are full memory barriers.
the reservation station or the load/store buffer?
Both. For normal loads/stores, the front-end allocates a load buffer entry for a load, or a store buffer entry for a store, and issues the uop into the ROB and RS.
For example, when the RS dispatches a store-address or store-data uop to port 4 (store-data) or p2/p3 (load or store-address), that execution unit will use the store-buffer entry as the place where it writes the data, or where it writes the address.
Having the store-buffer entry allocated by the issue/allocate/rename logic means that either store-address or store-data can execute first, whichever one has its inputs ready first, and free its space in the RS after completing successfully. The ROB entry stays allocated until the store retires. The store buffer entry stays allocated until some time after that, when the store commits to L1d cache. (Or for a store to uncacheable memory, commits to an LFB or something to be send out the memory hierarchy where the system agent will pick it up if it's to a MMIO region.)
Obviously IN/OUT are micro-coded as multiple uops, and all those uops are allocated in the ROB and reservation station as they issue from the front-end, like any other uop. (Well, some of them might not need a back-end execution unit, in which case they'd only be allocated in the ROB in an already-executed state. e.g. the uops for lfence are like this on Skylake.)
I'd assume they use the normal store buffer / load buffer mechanism for communicating off-core, but since they're more or less serializing there's no real performance implication to how they're implemented. (Later instructions can't start executing until after the "data phase" of the I/O transaction, and they drain the store buffer before executing.)
The execution of the IN and OUT instructions depends on the operating mode of the processor. In real mode, no permissions need to be checked to execute the instructions. In all other modes, the IOPL field of the Flags register and the I/O permission map associated with the current hardware task need to be checked to determine whether the IN/OUT instruction is allowed to execute. In addition, the IN/OUT instruction has serialization properties that are stronger than LFENCE but weaker than a fully serializing instruction. According to Section 8.2.5 of the Intel manual volume 3:
Memory mapped devices and other I/O devices on the bus are often
sensitive to the order of writes to their I/O buffers. I/O
instructions can be used to (the IN and OUT instructions) impose
strong write ordering on such accesses as follows. Prior to executing
an I/O instruction, the processor waits for all previous instructions
in the program to complete and for all buffered writes to drain to
memory. Only instruction fetch and page tables walks can pass I/O
instructions. Execution of subsequent instructions do not begin until
the processor determines that the I/O instruction has been completed.
This description suggests that an IN/OUT instruction completely blocks the allocation stage of the pipeline until all previous instructions are executed and the store buffer and WCBs are drained and then the IN/OUT instruction retires. To implement these serialization properties and to perform the necessary operating mode and permission checks, the IN/OUT instruction needs to be decoded into many uops. For more information on how such an instruction can be implemented, refer to: What happens to software interrupts in the pipeline?.
Older versions of the Intel optimization manual did provide latency and throughput numbers for the IN and OUT instructions. All of them seem to say that the worst-case latency is 225 cycles and the throughput is exactly 40 cycles per instruction. However, these numbers don't make much sense to me because I think the latency depends on the I/O device being read from or written to. And because these instructions are basically serialized, the latency essentially determines throughput.
I've tested the in al, 80h instruction on Haswell. According to #MargaretBloom, it's safe to read a byte from the port 0x80 (which according to osdev.org is mapped to some DMA controller register). Here is what I found:
The instruction is counted as a single load uop by MEM_UOPS_RETIRED.ALL_LOADS. It's also counted as a load uop that misses the L1D. However, it's not counted as a load uop that hits the L1D or misses or hits the L2 or L3 caches.
The distribution of uops is as follows: p0:16.4, p1:20, p2:1.2, p3:2.9, p4:0.07, p5:16.2, p6:42.8, and finally p7:0.04. That's a total of 99.6 uops per in al, 80h instruction.
The throughput of in al, 80h is 3478 cycles per instruction. I think the throughput depends on the I/O device though.
According to L1D_PEND_MISS.PENDING_CYCLES, the I/O load request seems to be allocated in an LFB for one cycle.
When I add an IMUL instruction that is dependent on the result of in instruction, the total execution time does not change. This suggests that the in instruction does not completely block the allocation stage until all of its uops are retired and it may overlap with later instructions, in contrast to my interpretation of the manual.
I've tested the out dx, al instruction on Haswell for ports 0x3FF, 0x2FF, 0x3EF, and 0x2EF. The distribution of uops is as follows: p0:10.9, p1:15.2, p2:1, p3:1, p4:1, p5:11.3, p6:25.3, and finally p7:1. That's a total of 66.7 uops per instruction. The throughput of out to 0x2FF, 0x3EF, and 0x2EF is 1880c. The throughput of out to 0x3FF is 6644.7c. The out instruction is not counted as a retired store.
Once the I/O load or store request reaches the system agent, it can determine what to do with the request by consulting its system I/O mapping table. This table depends on the chipset. Some I/O ports are mapped statically while other are mapped dynamically. See for example Section 4.2 of the Intel 100 Series Chipset datasheet, which is used for Skylake processors. Once the request is completed, the system agent sends a response back to the processor so that it can fully retire the I/O instruction.
Assume x86 multi-core PC architecture...
Lets say there are 2 cores (capable of executing 2 separate streams of instructions) and that the interface between the CPU and RAM is a memory bus.
Can 2 instructions (which access some memory) that are scheduled on the 2 different cores truly be simultaneous on such a machine?
I'm not talking about a case where the 2 instructions are accessing the same memory location. Even in the case where the 2 instructions are accessing completely different memory locations (and lets also assume that the memory contents for these locations are not in any cache), I would think that the single memory bus sitting in between the CPU and RAM (which is very common) would cause these 2 instructions to be serialized by the bus arbitration circuitry:
CPU0 CPU1
mov eax,[1000] mov ebx,[2000]
Is this true? If so, what is the advantage of having multiple cores if the software you will run is multi-threaded but has lots of memory accesses? Wouldn't these instructions all be serialized at the end?
Also, if this is true, whats the point of the LOCK prefix in x86 which is used for making a memory-access instruction atomic?
You need to check a few concepts of x86 architecture to answer that:
speculative execution (and out of order)
load store buffer
MESI protocol
load forwarding
memory barriers
NUMA
basically, my guess is your instructions will be absolutely parallel executed but the result in memory will be one or the other of the thread and the election will be decided by MESI hardware.
to extend on the answer, when you have multiple flow and single data (http://en.wikipedia.org/wiki/MISD) you need to expect serialization. Note that this can be mitigated if you access different memory adresses, notably on NUMA systems.
Opterons and new i7 has NUMA hardware, but the OS need to activate them, and its not by default. if you have NUMA, you can use the advantage of one bus to connect one core to one memory zone. however the core must be the owner of that zone, which should be verified if the core allocated its zone itself.
In all other hardware there will be serialization, but if the memory addresses are different they will not hinder on the write performance (no wait before end of write) thanks to the store buffer, and L2 intermediate caching. L2 content is commited to RAM later and L2 is by core so serialization happens but do not hinder CPU instructions that can continue on ahead.
EDIT about the LOCK question:
lock x86 instruction is about flushing load store buffers so that other cores can obtain visibility on the current values operated on in the instruction pipeline. this is much closer to the CPU than the RAM writing problem. LOCK allows that cores are not working on their local view of some variable content because without it, the CPU assumes any optimization it can considering only one thread, meaning it will often keep everything in registers and not rely on cache. It can ever go slightly ahead of that, when you consider load fowarding, or more preciselly called store to load forwarding.
I was reading about the MESI snooping cache coherence protocol, which I guess is the protocol that is used in modern multicore x86 processors (please correct me if I'm wrong). Now that article says this at one place.
A cache that holds a line in the Modified state must snoop (intercept) all
attempted reads (from all of the other caches in the system) of the
corresponding main memory location and insert the data that it holds. This is
typically done by forcing the read to back off (i.e. retry later), then writing
the data to main memory and changing the cache line to the Shared state.
Now what I don't understand is why the data needs to be written in the main memory. Cant the cache coherence just keeps the content in the caches synchronized without going to the memory (unless the cache line is truly evicted ofcourse)? I mean if one core is constantly reading and the other constantly writing, why not keep the data in the cache memory, and keep updating the data in the cache. Why incur the performance of writing back to the main memory?
In other words, can't the cores reading the data, directly read from the cache of the writing core and modify their cache accordingly?
Now what I don't understand is why the data needs to be written in the main memory. Cant
the cache coherence just keeps the content in the caches synchronized without going to
the memory (unless the cache line is truly evicted ofcourse)?
This does happen.
I have on my laptop an iCore 5 which looks like this;
M
N
S
L3U
L2U L2U
L1D L1D
L1I L1I
P P
L L L L
M = Main memory
N = NUMA node
S = Socket
L3U = Level 3 Unified
L2U = Level 2 Unified
L1D = Level 1 Data
L1I = Level 1 Instruction
P = Processor
L = Logical core
When two logical cores are operating on the same data, they don't move out to main memory; they exchange over the L1 and L2 caches. Likewise, when cores in the two processors are working, they exchange data over the L3 cache. Main memory isn't used unless eviction occurs.
But a simpler CPU could indeed be less clever about things.
The MESI protocol doesn't allow more than one caches to keep the same cache line in a modified state. So, if one cache line is modified and wants to be read from other processor´s cache, then it must be first written to main memory and then read, so that both processor´s caches now share that line (shared state)
Because caches typically aren't able to write directly into each other (as this would take more bandwidth).
what I don't understand is why the data needs to be written in the
main memory
Let's processor A has that modified line. Now processor B is trying to read that same cache (modified by A) line from main memory. Since the content in the main memory is invalid now (because A modified the content), A's snooping the any other read attempts for that line. So in order to allow processor B (and others) to read that line, A has to write it back to main memory.
I mean if one core is constantly reading and the other constantly writing,
why not keep the data in the cache memory, and keep updating the data in the cache.
You are right and this is what usually done. But here, that's not the case. Someone else (processor B in our example) is trying to read. So A has to write it back and make the cache line status to shared because both A and B are sharing the cache line now.
So actually I don't think the reading cache has to go to main memory.
In MESI, when a processor request a block modified by one of it's peers, it issue a read miss on the bus (or any interconnect medium), which is broadcasted to every processor.
The processor which hold the block in the "Modified" state catch the call, and issue a copy back on the bus - holding the block ID and the value - while changing it's own copy state to "shared". This copy back is received by the requesting processor, which will write the block in it's local cache and tag it as "shared".
It depends upon the implementation if the copy back issued by the owning processor goes to main memory, or not.
edit: note that the MOESI protocol add a "Owned" state that is very similar to "Shared", and allows a processor holding a copy of a block, in owned state, to copy back the value on the bus if it catches a broadcasted write/read miss for this block.
Assume there are two threads running on x86 CPU0 and CPU1 respectively. Thread running on CPU0 executes the following commands:
A=1
B=1
Cache line containing A initially owned by CPU1 and that containing B owned by CPU0.
I have two questions:
If I understand correctly, both stores will be put into CPU’s store buffer. However, for the first store A=1 the cache of CPU1 must be invalidated while the second store B=1 can be flushed immediately since CPU0 owns the cache line containing it. I know that x86 CPU respects store orders. Does that mean that B=1 will not be written to the cache before A=1?
Assume in CPU1 the following commands are executed:
while (B=0);
print A
Is it enough to add only lfence between the while and print commands in CPU1 without adding a sfence between A=1 and B=1 in CPU0 to get 1 always printed out on x86?
while (B=0);
lfence
print A
In x86, writes by a single processor are observed in the same order by all processors. No need to fence in your example, nor in any normal program on x86. Your program:
while(B==0); // wait for B == 1 to become globally observable
print A; // now, A will always be 1 here
What exactly happens in cache is model specific. All kinds of tricks and speculative behavior can occur in cache, but the observable behavior always follows the rules.
See Intel System Programming Guide Volume 3 section 8.2.2. for the details on memory ordering.