How do Intel CPUs that use the ring bus topology decode and handle port I/O operations - io

I understand Port I/O from a hardware abstraction level (i.e. asserts a pin that indicates to devices on the bus that the address is a port address, which makes sense on earlier CPUs with a simple address bus model) but I'm not really sure how it's implemented on modern CPUs microarchitecturally but also particularly how the Port I/O operation appears on the ring bus.
Firstly. Where does the IN/OUT instruction get allocated to, the reservation station or the load/store buffer? My initial thoughts were that it would be allocated in the load/store buffer and the memory scheduler recognises it, sends it to the L1d indicating that it is a port-mapped operation. A line fill buffer is allocated and it gets sent to L2 and then to the ring. I'm guessing that the message on the ring has some port-mapped indicator which only the system agent accepts and then it checks its internal components and relays the port-mapped indicated request to them; i.e. PCIe root bridge would pick up CF8h and CFCh. I'm guessing the DMI controller is fixed to pick up all the standardised ports that will appear on the PCH, such as the one for the legacy DMA controller.

Yes, I assume the message over the ring bus has some kind of tag that flags it as being to I/O space, not a physical memory address, and that the system agent sorts this out.
If anyone knows more details, that might be interesting, but this simple mental model is probably fine.
I don't know how port I/O turns into PCIe messages, but I think PCIe devices can have I/O ports in I/O space, not just MMIO.
IN/OUT are pretty close to serializing (but not officially defined using that term for some reason How many memory barriers instructions does an x86 CPU have?). They do drain the store buffer before executing, and are full memory barriers.
the reservation station or the load/store buffer?
Both. For normal loads/stores, the front-end allocates a load buffer entry for a load, or a store buffer entry for a store, and issues the uop into the ROB and RS.
For example, when the RS dispatches a store-address or store-data uop to port 4 (store-data) or p2/p3 (load or store-address), that execution unit will use the store-buffer entry as the place where it writes the data, or where it writes the address.
Having the store-buffer entry allocated by the issue/allocate/rename logic means that either store-address or store-data can execute first, whichever one has its inputs ready first, and free its space in the RS after completing successfully. The ROB entry stays allocated until the store retires. The store buffer entry stays allocated until some time after that, when the store commits to L1d cache. (Or for a store to uncacheable memory, commits to an LFB or something to be send out the memory hierarchy where the system agent will pick it up if it's to a MMIO region.)
Obviously IN/OUT are micro-coded as multiple uops, and all those uops are allocated in the ROB and reservation station as they issue from the front-end, like any other uop. (Well, some of them might not need a back-end execution unit, in which case they'd only be allocated in the ROB in an already-executed state. e.g. the uops for lfence are like this on Skylake.)
I'd assume they use the normal store buffer / load buffer mechanism for communicating off-core, but since they're more or less serializing there's no real performance implication to how they're implemented. (Later instructions can't start executing until after the "data phase" of the I/O transaction, and they drain the store buffer before executing.)

The execution of the IN and OUT instructions depends on the operating mode of the processor. In real mode, no permissions need to be checked to execute the instructions. In all other modes, the IOPL field of the Flags register and the I/O permission map associated with the current hardware task need to be checked to determine whether the IN/OUT instruction is allowed to execute. In addition, the IN/OUT instruction has serialization properties that are stronger than LFENCE but weaker than a fully serializing instruction. According to Section 8.2.5 of the Intel manual volume 3:
Memory mapped devices and other I/O devices on the bus are often
sensitive to the order of writes to their I/O buffers. I/O
instructions can be used to (the IN and OUT instructions) impose
strong write ordering on such accesses as follows. Prior to executing
an I/O instruction, the processor waits for all previous instructions
in the program to complete and for all buffered writes to drain to
memory. Only instruction fetch and page tables walks can pass I/O
instructions. Execution of subsequent instructions do not begin until
the processor determines that the I/O instruction has been completed.
This description suggests that an IN/OUT instruction completely blocks the allocation stage of the pipeline until all previous instructions are executed and the store buffer and WCBs are drained and then the IN/OUT instruction retires. To implement these serialization properties and to perform the necessary operating mode and permission checks, the IN/OUT instruction needs to be decoded into many uops. For more information on how such an instruction can be implemented, refer to: What happens to software interrupts in the pipeline?.
Older versions of the Intel optimization manual did provide latency and throughput numbers for the IN and OUT instructions. All of them seem to say that the worst-case latency is 225 cycles and the throughput is exactly 40 cycles per instruction. However, these numbers don't make much sense to me because I think the latency depends on the I/O device being read from or written to. And because these instructions are basically serialized, the latency essentially determines throughput.
I've tested the in al, 80h instruction on Haswell. According to #MargaretBloom, it's safe to read a byte from the port 0x80 (which according to osdev.org is mapped to some DMA controller register). Here is what I found:
The instruction is counted as a single load uop by MEM_UOPS_RETIRED.ALL_LOADS. It's also counted as a load uop that misses the L1D. However, it's not counted as a load uop that hits the L1D or misses or hits the L2 or L3 caches.
The distribution of uops is as follows: p0:16.4, p1:20, p2:1.2, p3:2.9, p4:0.07, p5:16.2, p6:42.8, and finally p7:0.04. That's a total of 99.6 uops per in al, 80h instruction.
The throughput of in al, 80h is 3478 cycles per instruction. I think the throughput depends on the I/O device though.
According to L1D_PEND_MISS.PENDING_CYCLES, the I/O load request seems to be allocated in an LFB for one cycle.
When I add an IMUL instruction that is dependent on the result of in instruction, the total execution time does not change. This suggests that the in instruction does not completely block the allocation stage until all of its uops are retired and it may overlap with later instructions, in contrast to my interpretation of the manual.
I've tested the out dx, al instruction on Haswell for ports 0x3FF, 0x2FF, 0x3EF, and 0x2EF. The distribution of uops is as follows: p0:10.9, p1:15.2, p2:1, p3:1, p4:1, p5:11.3, p6:25.3, and finally p7:1. That's a total of 66.7 uops per instruction. The throughput of out to 0x2FF, 0x3EF, and 0x2EF is 1880c. The throughput of out to 0x3FF is 6644.7c. The out instruction is not counted as a retired store.
Once the I/O load or store request reaches the system agent, it can determine what to do with the request by consulting its system I/O mapping table. This table depends on the chipset. Some I/O ports are mapped statically while other are mapped dynamically. See for example Section 4.2 of the Intel 100 Series Chipset datasheet, which is used for Skylake processors. Once the request is completed, the system agent sends a response back to the processor so that it can fully retire the I/O instruction.

Related

What is the benefit and micro-ops of ENQCMD instruction?

ENQCMD and MOVDIR64B are two instructions in Intel DSA.
MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. The ENQCMD instruction allows software to write commands to enqueue registers, which are special device registers accessed using memory-mapped I/O (MMIO).
My question is - what is the aim of designing those two instructions?
Based on my understanding, setting up the memory-mapped IO area (the register) requires OS support, i.e. the device driver. After setting up the MMIO area, we could access it using write() system call, which is also implemented in the device driver. For general architectures, Linux supports iowrite64() to write 8-byte values at a time. Hence, if we want to write 64 bytes, needs to call iowrite64() 8 times.
With the help of MOVDIR64B, for Intel DSA, a new API is created - __iowrite512() which writes 64 bytes atomically.
I agree that the latter one is at least more efficient than the previous one, but I am confused about the time it requires to transfer data.
Consider the following case: if we are given a device (Intel DSA) that supports MOVDIR64B and ENQCMD, suppose we want to transfer 64 bytes of data from memory to MMIO register. There are two options: iowrite64() 8 times (using a loop); or __iowrite512() once. Will the latter one be 8 times faster than the previous one?
My thoughts is that it is less likely to be 8 times difference, but the latter one will be faster. May I know how faster it would be? Is it documented anywhere? I do not have Intel DSA, so I am not sure how to test it.
Besides, what other benefits do ENQCMD have? Will it be broken up into several micro operations? If yes, then what are the micro operations that does ENQCMD?
iowrite64 uses a UC access to MMIO space, so writes are serialized, not pipelined. That is, only one UC write can be in flight at a time from a single CPU thread, and the CPU doesn't continue execution until the MMIO write is complete.
MOVDIR64B has the potential to be faster than even a single iowrite64, because it uses the WC memory type instead of UC (even if the destination address is UC). After the write is issued by the CPU, it can continue execution. Multiple direct stores can be streamed to the device. That means that multiple direct stores can be in flight at one time from a single CPU thread. MOVDIRI also behaves this way.
As far as I know, the time to actually transfer the data to the destination is the same regardless of the size (between 1 and 64 bytes). Of course that is dependent on the width of the data path within the SoC, which could be different for different implementations.
The main advantage of MOVDIR64B is that the descriptor arrives at the device all at once instead of in pieces. The device doesn't have to worry about receiving a partial descriptor or receiving parts of two descriptors interleaved. In fact, Intel DSA ignores writes smaller than 64 bytes to a portal.
To realize the full benefit of streaming writes, the destination address for each MOVDIR64B from a single CPU thread should be different. Each Intel DSA portal is a 4096-byte page, so there are 64 unique addresses within each portal. Descriptor writes from a single CPU can be striped across the 64 addresses. (It doesn't matter whether writes from multiple CPUs use the same address or different addresses, but normally you would not expect multiple CPUs to be using the same dedicated WQ in DSA.)
ENQCMD allows the device to respond to software whether it accepted the descriptor or not. This allows multiple applications to use the same shared WQ without risk of a descriptor being lost because the shared WQ is full. Applications can submit descriptors without any driver involvement (after setup), and without any lock or communication between the applications.

How to generate a zero-length read on PCIE Bus using x86-64 and Linux?

In the PCI Express Base specification, section 2.2.5 "First/Last DW Byte Enables Rules", it says a zero length read can be used as a flush request. However, in the linux kernel documentation, most examples just use either a 1B or 4B read request:
Bus-Independent Device Accesses
How To Write Linux PCI Drivers
I'm wondering if it's possible the x86-64 architecture is capable of generating an instruction that causes a zero length read on PCI, and if it can, if there is some linux kernel function that creates that instruction.
The two examples you mentioned involve MMIO accesses or legacy I/O port accesses from the CPU to an I/O device, but the zero-length read implementation note from Section 2.2.5 of the PCIe specification is about accesses from an I/O device. The PCIe spec and the Intel/AMD64 x86 manuals are obviously different from each other and they use different terms, so I don't understand how you confused the two. No, there is no such thing as zero-length read in x86.
The code from the first link is the following:
WRT_REG_WORD(&reg->ictrl, 0);
/*
* The following read will ensure that the above write
* has been received by the device before we return from this
* function.
*/
RD_REG_WORD(&reg->ictrl);
There is a 16-bit MMIO write followed by a 16-bit MMIO read to the same address. The memory type of the target location is most probably UC, which ensures that all UC accesses appears on the system bus in program order. This means that it reaches the PCIe root complex (which is integrated on modern processors) in order. An MMIO write is translated by the processor's I/O unit to a posted write PCIe transaction and the read is translated to a non-posted read PCIe transaction. Both of these transactions would have traffic class and with relaxed ordering disabled. According to the transaction order rules, such a non-posted read cannot be reordered with any earlier posted write. The overall effect is that when the UC read gets back the result, the preceding UC write must have already completed at the target I/O device.
The second link you provided also includes an example of MMIO ordering that works exactly the same way. Issuing a read after a posted write is a commonly used technique to determine when the write has completed. A UC read is not a fully serializing operation in x86. If you don't want any later instructions (not UC accesses) to execute until the read completes, you need to add a fully serializing instruction after the read. The Linux kernel itself defines numerous MMIO barriers used in different situations.
The second link also mention that a legacy I/O write doesn't require a following read because "I/O Port space guarantees write transactions reach the PCI device before the CPU can continue." I/O instructions provide more ordering guarantees than UC accesses, but still they are not fully serializing. Among these guarantees include waiting for previous instructions to commit before executing an I/O instruction and not allowing later instructions to execute until the I/O instruction completes. These guarantees combined with the fact that I/O instructions are translated by the I/O controller to PCIe I/O transactions, where an I/O write transaction is a non-posted transaction, ensures that when the next instruction executes, it's guaranteed that the I/O write has completed at the target I/O device.
Zero-length reads can be used by an I/O device to determine that earlier writes have completed at the destination. This is how, for example, an I/O device can ensure that a write has reached the persistence domain on a platform that supports Asynchronous
DRAM Refresh (ADR) or that a write has become observable by the device driver.

Memory mapped IO - how does IO device know value has changed?

How does an IO device know that a value in memory pertaining to it has changed in memory mapped IO?
For example, let's say memory address 0 has been dedicated to hold the background color for a VGA device. How does the VGA device know when we change the value in memory[0]? Is the VGA device constantly polling the memory location? Or does the CPU somehow notify the device when it changes the value (and if so how?)?
An example architecture is MIPS. Given that the MIPS instruction set does not have in or out instructions, I don't understand how it could possibly communicate (on change) with the VGA device in the example. Another example is the ARM architecture.
In memory-mapped I/O, performing a memory read/write to the device's memory region will cause the CPU to perform a transaction with the device to fetch/store that value -- either directly through the CPU's memory bus, or through a secondary bus (such as AHB/APB on ARM systems). This memory transaction directly notifies the device that a value is being changed; no separate notification is necessary.
You're assuming that memory-mapped I/O is mapped by normal RAM. This is not the case. Indeed, these devices may behave in ways which are entirely unlike real memory! For instance, a typical UART or SPI device implementation may have a single data register which can be written to to transmit data, or read from to retrieve received data. Similarly, it's not uncommon for interrupt registers to have "clear on read" or "write 1 to clear" semantics.
For what it's worth: in practice, many framebuffer graphics implementations do actually behave as normal memory. What's different is that the memory is stored in a dual-ported RAM (or a time-multiplexed bus), and the video RAMDAC continuously reads through that memory to transmit its contents to an attached display.
A region of the physical address space that is designated as memory-mapped I/O (MMIO) is not mapped to main memory (system memory); it's mapped to I/O registers which are physically part of the I/O device.
To determine how to handle a memory access (read or write), the processor checks first the type of the region to which the target memory address belongs. In any MIPS processor, there are at least two types: Uncached and Cached. MMIO regions are always Uncached. An Uncached memory access request is directly sent to the main memory controller without examining or affecting any of the caches. However, an I/O Uncached memory access request is sent to an I/O controller, and eventually the request will reach the destination I/O device.
Now exactly how the CPU and the I/O device communicate with each other is completely specified by the I/O device itself. So an I/O device would have a specification that discusses how many I/O registers there are and how each of them should be used. An I/O register could be used to hold status flags, control flags, data to be read or written by the CPU, or some combination thereof. Note that since the I/O registers are physically part of the I/O device, then the I/O device can be designed so that it can detect when any of its registers are being read from or written to and take an action accordingly if required.
An I/O device can send an interrupt to the CPU to inform it that some data is available or maybe it wants attention for whatever reason. The CPU can also frequently poll the I/O device by checking some status flag(s) and then take some action accordingly.

What will be used for data exchange between threads are executing on one Core with HT?

Hyper-Threading Technology is a form of simultaneous multithreading
technology introduced by Intel.
These resources include the execution engine, caches, and system bus
interface; the sharing of resources allows two logical processors to
work with each other more efficiently, and allows a stalled logical
processor to borrow resources from the other one.
In the Intel CPU with Hyper-Threading, one CPU-Core (with several ALUs) can execute instructions from 2 threads at the same clock. And both 2 threads share: store-buffer, caches L1/L2 and system bus.
But if two thread execute simultaneous on one Core, thread-1 stores atomic value and thread-2 loads this value, what will be used for this exchange: shared store-buffer, shared cache L1 / L2 or as usual cache L3?
What will be happen if both 2 threads from one the same process (the same virtual address space) and if from two different processes (the different virtual address space)?
Sandy Bridge Intel CPU - cache L1:
32 KB - cache size
64 B - cache line size
512 - lines (512 = 32 KB / 64 B)
8-way
64 - number sets of ways (64 = 512 lines / 8-way)
6 bits [11:6] - of virtual address (index) defines current set number (this is tag)
4 K - each the same (virtual address / 4 K) compete for the same set (32 KB / 8-way)
low 12 bits - significant for determining the current set number
4 KB - standard page size
low 12 bits - the same in virtual and physical addresses for each address
I think you'll get a round-trip to L1. (Not the same thing as store->load forwarding within a single thread, which is even faster than that.)
Intel's optimization manual says that store and load buffers are statically partitioned between threads, which tells us a lot about how this will work. I haven't tested most of this, so please let me know if my predictions aren't matching up with experiment.
Update: See this Q&A for some experimental testing of throughput and latency.
A store has to retire in the writing thread, and then commit to L1 from the store buffer/queue some time after that. At that point it will be visible to the other thread, and a load to that address from either thread should hit in L1. Before that, the other thread should get an L1 hit with the old data, and the storing thread should get the stored data via store->load forwarding.
Store data enters the store buffer when the store uop executes, but it can't commit to L1 until it's known to be non-speculative, i.e. it retires. But the store buffer also de-couples retirement from the ROB (the ReOrder Buffer in the out-of-order core) vs. commitment to L1, which is great for stores that miss in cache. The out-of-order core can keep working until the store buffer fills up.
Two threads running on the same core with hyperthreading can see StoreLoad re-ordering if they don't use memory fences, because store-forwarding doesn't happen between threads. Jeff Preshing's Memory Reordering Caught in the Act code could be used to test for it in practice, using CPU affinity to run the threads on different logical CPUs of the same physical core.
An atomic read-modify-write operation has to make its store globally visible (commit to L1) as part of its execution, otherwise it wouldn't be atomic. As long as the data doesn't cross a boundary between cache lines, it can just lock that cache line. (AFAIK this is how CPUs do typically implement atomic RMW operations like lock add [mem], 1 or lock cmpxchg [mem], rax.)
Either way, once it's done the data will be hot in the core's L1 cache, where either thread can get a cache hit from loading it.
I suspect that two hyperthreads doing atomic increments to a shared counter (or any other locked operation, like xchg [mem], eax) would achieve about the same throughput as a single thread. This is much higher than for two threads running on separate physical cores, where the cache line has to bounce between the L1 caches of the two cores (via L3).
movNT (Non-Temporal) weakly-ordered stores bypass the cache, and put their data into a line-fill buffer. They also evict the line from L1 if it was hot in cache to start with. They probably have to retire before the data goes into a fill buffer, so a load from the other thread probably won't see it at all until it enters a fill-buffer. Then probably it's the same as an movnt store followed by a load inside a single thread. (i.e. a round-trip to DRAM, a few hundred cycles of latency). Don't use NT stores for a small piece of data you expect another thread to read right away.
L1 hits are possible because of the way Intel CPUs share the L1 cache. Intel uses virtually indexed, physically tagged (VIPT) L1 caches in most (all?) of their designs. (e.g. the Sandybridge family.) But since the index bits (which select a set of 8 tags) are below the page-offset, it behaves exactly like a PIPT cache (think of it as translation of the low 12 bits being a no-op), but with the speed advantage of a VIPT cache: it can fetch the tags from a set in parallel with the TLB lookup to translate the upper bits. See the "L1 also uses speed tricks that wouldn't work if it was larger" paragraph in this answer.
Since L1d cache behaves like PIPT, and the same physical address really means the same memory, it doesn't matter whether it's 2 threads of the same process with the same virtual address for a cache line, or whether it's two separate processes mapping a block of shared memory to different addresses in each process. This is why L1d can be (and is) competitively by both hyperthreads without risk of false-positive cache hits. Unlike the dTLB, which needs to tag its entries with a core ID.
A previous version of this answer had a paragraph here based on the incorrect idea that Skylake had reduced L1 associativity. It's Skylake's L2 that's 4-way, vs. 8-way in Broadwell and earlier. Still, the discussion on a more recent answer might be of interest.
Intel's x86 manual vol3, chapter 11.5.6 documents that Netburst (P4) has an option to not work this way. The default is "Adaptive mode", which lets logical processors within a core share data.
There is a "shared mode":
In shared mode, the L1 data cache is competitively shared between logical processors. This is true even if the
logical processors use identical CR3 registers and paging modes.
In shared mode, linear addresses in the L1 data cache can be aliased, meaning that one linear address in the cache
can point to different physical locations. The mechanism for resolving aliasing can lead to thrashing. For this
reason, IA32_MISC_ENABLE[bit 24] = 0 is the preferred configuration for processors based on the Intel NetBurst
microarchitecture that support Intel Hyper-Threading Technology
It doesn't say anything about this for hyperthreading in Nehalem / SnB uarches, so I assume they didn't include "slow mode" support when they introduced HT support in another uarch, since they knew they'd gotten "fast mode" to work correctly in netburst. I kinda wonder if this mode bit only existed in case they discovered a bug and had to disable it with microcode updates.
The rest of this answer only addresses the normal setting for P4, which I'm pretty sure is also the way Nehalem and SnB-family CPUs work.
It would be possible in theory to build an OOO SMT CPU core that made stores from one thread visible to the other as soon as they retired, but before they leaves the store buffer and commit to L1d (i.e. before they become globally visible). This is not how Intel's designs work, since they statically partition the store queue instead of competitively sharing it.
Even if the threads shared one store-buffer, store forwarding between threads for stores that haven't retired yet couldn't be allowed because they're still speculative at that point. That would tie the two threads together for branch mispredicts and other rollbacks.
Using a shared store queue for multiple hardware threads would take extra logic to always forward to loads from the same thread, but only forward retired stores to loads from the other thread(s). Besides transistor count, this would probably have a significant power cost. You couldn't just omit store-forwarding entirely for non-retired stores, because that would break single-threaded code.
Some POWER CPUs may actually do this; it seems like the most likely explanation for not all threads agreeing on a single global order for stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
As #BeeOnRope points out, this wouldn't work for an x86 CPU, only for an ISA that doesn't guarantee a Total Store Order, because this this would let the SMT sibling(s) see your store before it becomes globally visible to other cores.
TSO could maybe be preserved by treating data from sibling store-buffers as speculative, or not able to happen before any cache-miss loads (because lines that stay hot in your L1D cache can't contain new stores from other cores). IDK, I haven't thought this through fully. It seems way overcomplicated and probably not able to do useful forwarding while maintaining TSO, even beyond the complications of having a shared store-buffer or probing sibling store-buffers.

Can 2 instructions be truly simultaneous on a multi-core CPU

Assume x86 multi-core PC architecture...
Lets say there are 2 cores (capable of executing 2 separate streams of instructions) and that the interface between the CPU and RAM is a memory bus.
Can 2 instructions (which access some memory) that are scheduled on the 2 different cores truly be simultaneous on such a machine?
I'm not talking about a case where the 2 instructions are accessing the same memory location. Even in the case where the 2 instructions are accessing completely different memory locations (and lets also assume that the memory contents for these locations are not in any cache), I would think that the single memory bus sitting in between the CPU and RAM (which is very common) would cause these 2 instructions to be serialized by the bus arbitration circuitry:
CPU0 CPU1
mov eax,[1000] mov ebx,[2000]
Is this true? If so, what is the advantage of having multiple cores if the software you will run is multi-threaded but has lots of memory accesses? Wouldn't these instructions all be serialized at the end?
Also, if this is true, whats the point of the LOCK prefix in x86 which is used for making a memory-access instruction atomic?
You need to check a few concepts of x86 architecture to answer that:
speculative execution (and out of order)
load store buffer
MESI protocol
load forwarding
memory barriers
NUMA
basically, my guess is your instructions will be absolutely parallel executed but the result in memory will be one or the other of the thread and the election will be decided by MESI hardware.
to extend on the answer, when you have multiple flow and single data (http://en.wikipedia.org/wiki/MISD) you need to expect serialization. Note that this can be mitigated if you access different memory adresses, notably on NUMA systems.
Opterons and new i7 has NUMA hardware, but the OS need to activate them, and its not by default. if you have NUMA, you can use the advantage of one bus to connect one core to one memory zone. however the core must be the owner of that zone, which should be verified if the core allocated its zone itself.
In all other hardware there will be serialization, but if the memory addresses are different they will not hinder on the write performance (no wait before end of write) thanks to the store buffer, and L2 intermediate caching. L2 content is commited to RAM later and L2 is by core so serialization happens but do not hinder CPU instructions that can continue on ahead.
EDIT about the LOCK question:
lock x86 instruction is about flushing load store buffers so that other cores can obtain visibility on the current values operated on in the instruction pipeline. this is much closer to the CPU than the RAM writing problem. LOCK allows that cores are not working on their local view of some variable content because without it, the CPU assumes any optimization it can considering only one thread, meaning it will often keep everything in registers and not rely on cache. It can ever go slightly ahead of that, when you consider load fowarding, or more preciselly called store to load forwarding.

Resources