Cache Coherency updation in SMP

Cache Coherency updation in SMP - multithreading

Taking a SMP (Symmetric Multi-processing) machine which has two seprate processors in it.
Here if two threads are running on two different processors and sharing some data.
This Shared data is kept into two different processor's cache for the two running threads.
So ,if thread 1 (running in furst processor) updates the value of the shared data in its cache ,then the thread running on the second processor will not be able to get the updated data ?
Is this problem correct ?
How this problem is solved ?

It's solved with cache coherency hardware.
Modern multi-core processors use sophisticated cache coherency protocols. While understanding the intricacies of these protocols is probably not useful, understanding the basic concepts behind them is extremely valuable. Understanding the MESI protocol is a great starting point.
Generally, before a processor can perform a cached write to a chunk of cacheable memory, its cache must hold that memory exclusively. That is, no other processors may cache it.
So if the first processor goes to write to an area of memory, it will acquire that memory exclusively in its cache using the inter-cache protocol. If the second processor then tries to read that area of memory, it will not find it in its cache, and use the inter-cache protocol to share the latest version with the first processor.
If the second processor wants to write to that area of memory, it will use the inter-cache protocol to invalidate the other processor's cached copy. That will force the first processor to re-acquire the (possibly modified) version before it can read it.
The actual details can vary depending on the hardware specifics. No modern multi-core CPU requires writing data back to main memory to make it visible to other cores.

Related

How to read stale values on x86

My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.

I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.

Cache synchronized between cores

Each processor core can have its own cache. Cache is write through and read through. If two threads are running on different cores and are synchronized by semaphores can it happen that on read of memory location caches have different version of this location or are they somehow transparently synchronized by processor? I am interested in x86 and RISC.

Every SMP machine you are likely to use has cache coherency implemented in hardware.

According to Linux documentation (https://www.kernel.org/doc/Documentation/memory-barriers.txt):
for while the caches are expected to be coherent, there's no guarantee that that coherency
will be ordered. This means that whilst changes made on one CPU will
eventually become visible on all CPUs, there's no guarantee that they will
become apparent in the same order on those other CPUs.

Can 2 instructions be truly simultaneous on a multi-core CPU

Assume x86 multi-core PC architecture...
Lets say there are 2 cores (capable of executing 2 separate streams of instructions) and that the interface between the CPU and RAM is a memory bus.
Can 2 instructions (which access some memory) that are scheduled on the 2 different cores truly be simultaneous on such a machine?
I'm not talking about a case where the 2 instructions are accessing the same memory location. Even in the case where the 2 instructions are accessing completely different memory locations (and lets also assume that the memory contents for these locations are not in any cache), I would think that the single memory bus sitting in between the CPU and RAM (which is very common) would cause these 2 instructions to be serialized by the bus arbitration circuitry:
CPU0 CPU1
mov eax,[1000] mov ebx,[2000]
Is this true? If so, what is the advantage of having multiple cores if the software you will run is multi-threaded but has lots of memory accesses? Wouldn't these instructions all be serialized at the end?
Also, if this is true, whats the point of the LOCK prefix in x86 which is used for making a memory-access instruction atomic?

You need to check a few concepts of x86 architecture to answer that:
speculative execution (and out of order)
load store buffer
MESI protocol
load forwarding
memory barriers
NUMA
basically, my guess is your instructions will be absolutely parallel executed but the result in memory will be one or the other of the thread and the election will be decided by MESI hardware.
to extend on the answer, when you have multiple flow and single data (http://en.wikipedia.org/wiki/MISD) you need to expect serialization. Note that this can be mitigated if you access different memory adresses, notably on NUMA systems.
Opterons and new i7 has NUMA hardware, but the OS need to activate them, and its not by default. if you have NUMA, you can use the advantage of one bus to connect one core to one memory zone. however the core must be the owner of that zone, which should be verified if the core allocated its zone itself.
In all other hardware there will be serialization, but if the memory addresses are different they will not hinder on the write performance (no wait before end of write) thanks to the store buffer, and L2 intermediate caching. L2 content is commited to RAM later and L2 is by core so serialization happens but do not hinder CPU instructions that can continue on ahead.
EDIT about the LOCK question:
lock x86 instruction is about flushing load store buffers so that other cores can obtain visibility on the current values operated on in the instruction pipeline. this is much closer to the CPU than the RAM writing problem. LOCK allows that cores are not working on their local view of some variable content because without it, the CPU assumes any optimization it can considering only one thread, meaning it will often keep everything in registers and not rely on cache. It can ever go slightly ahead of that, when you consider load fowarding, or more preciselly called store to load forwarding.

Does pinning a process to a CPU core or an SMP node help reduce cache coherency traffic?

It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says:
Restricting a process to run on a single CPU also avoids the
performance cost caused by the cache invalidation that occurs when a process
ceases to execute on one CPU and then recommences execution on a different
CPU.
Which is almost an obvious thing (or not?). What is not that obvious to me is this -
Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.

There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.
You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune. FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode.
-- EDIT --
My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in:
http://ark.intel.com/products/37111/Intel-Xeon-Processor-X5570-%288M-Cache-2_93-GHz-6_40-GTs-Intel-QPI%29
However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket. Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either.
So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.

If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).
If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.
On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.

DMA cache coherence management

My question is this: how can I determine when it is safe to disable cache snooping when I am correctly using [pci_]dma_sync_single_for_{cpu,device} in my device driver?
I'm working on a device driver for a device which writes directly to RAM over PCI Express (DMA), and am concerned about managing cache coherence. There is a control bit I can set when initiating DMA to enable or disable cache snooping during DMA, clearly for performance I would like to leave cache snooping disabled if at all possible.
In the interrupt routine I call pci_dma_sync_single_for_cpu() and ..._for_device() as appropriate, when switching DMA buffers, but on 32-bit Linux 2.6.18 (RHEL 5) it turns out that these commands are macros which expand to nothing ... which explains why my device returns garbage when cache snooping is disabled on this kernel!
I've trawled through the history of the kernel sources, and it seems that up until 2.6.25 only 64-bit x86 had hooks for DMA synchronisation. From 2.6.26 there seems to be a generic unified indirection mechanism for DMA synchronisation (currently in include/asm-generic/dma-mapping-common.h) via fields sync_single_for_{cpu,device} of dma_map_ops, but so far I've failed to find any definitions of these operations.

I'm really surprised no one has answered this, so here we go on a non-Linux specific answer (I have insufficient knowledge of the Linux kernel itself to be more specific) ...
Cache snooping simply tells the DMA controller to send cache invalidation requests to all CPUs for the memory being DMAed into. This obviously adds load to the cache coherency bus, and it scales particularly badly with additional processors as not all CPUs will have a single hop connection with the DMA controller issuing the snoop. Therefore, the simple answer to "when it is safe to disable cache snooping" is when the memory being DMAed into either does not exist in any CPU cache OR its cache lines are marked as invalid. In other words, any attempt to read from the DMAed region will always result in a read from main memory.
So how do you ensure reads from a DMAed region will always go to main memory?
Back in the day before we had fancy features like DMA cache snooping, what we used to do was to pipeline DMA memory by feeding it through a series of broken up stages as follows:
Stage 1: Add "dirty" DMA memory region to the "dirty and needs to be cleaned" DMA memory list.
Stage 2: Next time the device interrupts with fresh DMA'ed data, issue an async local CPU cache invalidate for DMA segments in the "dirty and needs to be cleaned" list for all CPUs which might access those blocks (often each CPU runs its own lists made up of local memory blocks). Move said segments into a "clean" list.
Stage 3: Next DMA interrupt (which of course you're sure will not occur before the previous cache invalidate has completed), take a fresh region from the "clean" list and tell the device that its next DMA should go into that. Recycle any dirty blocks.
Stage 4: Repeat.
As much as this is more work, it has several major advantages. Firstly, you can pin DMA handling to a single CPU (typically the primary CPU0) or a single SMP node, which means only a single CPU/node need worry about cache invalidation. Secondly, you give the memory subsystem much more opportunity to hide memory latencies for you by spacing out operations over time and spreading out load on the cache coherency bus. The key for performance is generally to try and make any DMA occur on a CPU as close to the relevant DMA controller as possible and into memory as close to that CPU as possible.
If you always hand off newly DMAed into memory to user space and/or other CPUs, simply inject freshly acquired memory in at the front of the async cache invalidating pipeline. Some OSs (not sure about Linux) have an optimised routine for preordering zeroed memory, so the OS basically zeros memory in the background and keeps a quick satisfy cache around - it will pay you to keep new memory requests below that cached amount because zeroing memory is extremely slow. I'm not aware of any platform produced in the past ten years which uses hardware offloaded memory zeroing, so you must assume that all fresh memory may contain valid cache lines which need invalidating.
I appreciate this only answers half your question, but it's better than nothing. Good luck!
Niall

Maybe a bit overdue, but:
If you disable cache snooping, hardware will no longer take care of cache coherency. Hence, the kernel needs to do this itself. Over the past few days, I've spent some tiem reviewing the X86 variants of [pci_]dma_sync_single_for_{cpu,device}. I've found no indication that they perform any efforts to maintain coherency. This seems consistent with the fact that cache snooping is by default turned on in the PCI(e) spec.
Hence, if you are turning off cache snooping, you will have to maintain coherency yourself, in your driver. Possibly by calling clflush_cache_range() (X86) or similar?
Refs:
http://lkml.indiana.edu/hypermail/linux/kernel/0709.0/1329.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string