Threads on same core accessing same cache line - multithreading

I am working in a bare-metal environment and thus evaluating performance at a low-level. How should I expect two threads on the same core to perform when writing to different sections of the same cache line?
I am somewhat new to multicore/multithread architectures. I understand that when different cores write to the same cache line locks or atomic operations are required to ensure race conditions are avoided. At the same time sharing a cache line between cores also sets one up for performance issues such as false sharing.
However, do I need to worry about similar things when the two threads are on the same core? I'm unsure seeing as they share the same cache and there are multiple load-store units. For example, say thread1 writes to section1 of the cache line at the same time that thread2 wants to write to section2 of the cache line. Does each thread just modify its own section of the cache line, or do they read the full line, modify their section, and write the full line back into the cache? If it's the latter do I need to worry about race conditions or performance delays?

You are over-complicating this.
There are different layers of caches, depends very specifically on the cpu you are using not just generically x86 or arm, but which architecture version/generation, but you may have an L1 cache intimately connected to the individual cores, then L2 is where the cores come together on the way to shared memory/address space.
All a cache does at whatever layer is sit on the main memory (space) bus and watch things go by, if a transaction is tagged as cacheable, then it examines its tags to see if there is a hit or miss and acts accordingly. The cache does not know, cannot know, nor care who or what caused that transaction, was it an instruction what instruction, what task/program/thread was that instruction from, is it a prefetch, is it a dma engine. doesnt care, there is a transaction like any other follow the rules, pass it on through if not cacheable, if cacheable look for hits and deal with hits or misses.
So from that if you have more than one core/cpu hitting a shared cache, and for some reason they happen to be accessing memory so close that it is in the same cache line, well then the cache will react accordingly.
if you have the same cpu with two threads, will the whole at the same time thing doesnt apply, of course it doesnt apply on the shared cpu as well, you could have them one clock apart but it is a shared bus, generally not dual/multi-ported at this level. but despite that the cache will act per its design, ignore and pass on if marked as not-cacheable, or search for a hit if it is and act accordingly.

Related

Does a mutex lock guarantee that a thread will always store updated values into the main memory?

a. Does accessing a memory location with a mutex lock mean that whatever the critical code is doing to the mutexed-variables will end up into the main memory, and not only updated inside the thread's cache or registers without a fresh copy of values in the main memory?
b. If that's the case, aren't we effectively running the critical core as if we didn't have a cache (at least no cache locations for mutex-lock variables)?
c. And if that is the case then isn't the critical code a heavy weight code, and needs to be as small as possible, considering the continued need to read from and write into the main memory at least at the beginning and end of the mutex-locking session?
a. Does accessing a memory location with a mutex lock mean that whatever the critical code is doing to the mutexed-variables will end up into the main memory, and not only updated inside the thread's cache or registers without a fresh copy of values in the main memory?
A correctly implemented mutex guarantees that previous writes are visible to other agents (e.g. other CPUs) when the mutex is released. On systems with cache coherency (e.g. 80x86) modifications are visible when they're in a cache and it doesn't matter if modifications have reached main memory.
Essentially (over-simplified), for cache coherency, when the other CPU wants the modified data it broadcasts a request (like "Hey, I want the data at address 123456"), and if its in another CPUs' cache the other CPU responds with "Here's the data you wanted", and if the data isn't in any cache the memory controller responds with "Here's the data you wanted"; and the CPU gets the most recent version of the data regardless of where the data was or what responds to the request. In practice it's a lot more complex - I'd recommend reading about the MESI cache control protocol if you're interested ( https://en.wikipedia.org/wiki/MESI_protocol ).
b. If that's the case, aren't we effectively running the critical core as if we didn't have a cache (at least no cache locations for mutex-lock variables)?
If it is the case (e.g. if there's no cache coherency); something (the code to release a mutex) would have to ensure that modified data is written back to RAM before the mutex can be acquired by something else. This doesn't prevent the cache from being used inside the critical section (e.g. critical section could write to cache, and then the modified data can be sent from cache to RAM after).
The cost would depend on various factors (CPU speed, cache speed and memory speed, and whether the cache is "write back" or "write through", and how much data is modified). For some cases (relatively slow CPU with write-through caches) the cost may be almost nothing.
c. And if that is the case then isn't the critical code a heavy weight code, and needs to be as small as possible, considering the continued need to read from and write into the main memory at least at the beginning and end of the mutex-locking session?
It's not as heavy as not using caches.
Synchronizing access (regardless of how its done) is always going to be more expensive than not synchronizing access (and crashing because all your data got messed up). ;-)
One of the challenges of multi-threaded code is finding a good compromise between the cost of synchronization and parallelism - a small number of locks (or a single global lock) reduces the cost of synchronization but limits parallelism (threads getting nothing done waiting to acquire a lock); and a large number of locks increases the cost of synchronization (e.g. acquiring more locks is more expensive than acquiring one) but allows more parallelism.
Of course parallelism is also limited by the number of CPUs you have; which means that a good compromise for one system (with few CPUs) may not be a good compromise on another system (with lots of CPUs).

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.
A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.
There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Memory barriers force cache coherency?

I was reading this question about using a bool for thread control and got intrigued by this answer by #eran:
Using volatile is enough only on single cores, where all threads use the same cache. On multi-cores, if stop() is called on one core and run() is executing on another, it might take some time for the CPU caches to synchronize, which means two cores might see two different views of isRunning_.
If you use synchronization mechanisms, they will ensure all caches get the same values, in the price of stalling the program for a while. Whether performance or correctness is more important to you depends on your actual needs.
I have spent over an hour searching for some statement that says synchronization primitives force cache coherency but have failed. The closest I have come is Wikipedia:
The keyword volatile does not guarantee a memory barrier to enforce cache-consistency.
Which suggests that memory barriers do force cache consistency, and since some synchronization primitives are implemented using memory barriers (again from Wikipedia) this is some "evidence".
But I don't know enough to be certain whether to believe this or not, and be sure that I'm not misinterpreting it.
Can someone please clarify this?
Short Answer : Cache coherency works most of the time but not always. You can still read stale data. If you don't want to take chances, then just use a memory barrier
Long Answer : CPU core is no longer directly connected to the main memory. All loads and stores have to go through the cache. The fact that each CPU has its own private cache causes new problems. If more than one CPU is accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet to main memory) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. . Instead the content of the first processor’s cacheline is needed. The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor’s cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor’s cache? Assuming it just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Here comes cache coherency protocols. CPU's maintain data consistency across their caches via MESI or some other cache coherence protocol.
With cache coherency in place, should we not see that latest value always for the cacheline even if it was modified by another CPU? After all that is whole purpose of the cache coherency protocols. Usually when a cacheline is modified, the corresponding CPU sends an "invalidate cacheline" request to all other CPU's. It turns out that CPU’s can send acknowledgement to the invalidate requests immediately but defer the actual invalidation of the cacheline to a later point in time. This is done via invalidation queues. Now if we get un-lucky enough to read the cacheline within this short window (between the CPU acknowledging an invalidation request and actually invalidating the cacheline) then we can read a stale value. Now why would a CPU do such a horrible thing. The simple answer is PERFORMANCE. So lets look into different scenarios where invalidation queues can improve performance
Scenario 1 : CPU1 receives an invalidation request from CPU2. CPU1 also has a lot of stores and loads queued up for the cache. This means that the invalidation of the requested cacheline takes times and CPU2 gets stalled waiting for the acknowledgment
Scenario 2 : CPU1 receives a lot of invalidation requests in a short amount of time. Now it takes time for CPU1 to invalidate all the cachelines.
Placing an entry into the invalidate queue is essentially a promise by the CPU to process that entry before transmitting any MESI protocol messages regarding that cache line. So invalidation queues are the reason why we may not see the latest value even when doing a simple read of a single variable.
Now the keen reader might be thinking, when the CPU wants to read a cacheline, it could scan the invalidation queue first before reading from the cache. This should avoid the problem. However the CPU and invalidation queue are physically placed on opposite sides of the cache and this limits the CPU from directly accessing the invalidation queue. (Invalidation queues of one CPU’s cache are populated by cache coherency messages from other CPU’s via the system bus. So it kind of makes sense for the invalidation queues to be placed between the cache and the system bus). So in order to actually see the latest value of any shared variable, we should empty the invalidation queue. Usually a read memory barrier does that.
I just talked about invalidation queues and read memory barriers. [1] is a good reference for understanding the need for read and write memory barriers and details of MESI cache coherency protocol
[1] http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
As I understand, synchronization primitives won't affect cache coherency at all. Cache is French for hidden, it's not supposed to be visible to the user. A cache coherency protocol should work without the programmer's involvement.
Synchronization primitives will affect the memory ordering, which is well defined and visible to the user through the processor's ISA.
A good source with detailed information is A Primer on Memory Consistency and Cache Coherence from the Synthesis Lectures on Computer Architecture collection.
EDIT: To clarify your doubt
The Wikipedia statement is slightly wrong. I think the confusion might come from the terms memory consistency and cache coherency. They don't mean the same thing.
The volatile keyword in C means that the variable is always read from memory (as opposed to a register) and that the compiler won't reorder loads/stores around it. It doesn't mean the hardware won't reorder the loads/stores. This is a memory consistency problem. When using weaker consistency models the programmer is required to use synchronization primitives to enforce a specific ordering. This is not the same as cache coherency. For example, if thread 1 modifies location A, then after this event thread 2 loads location A, it will receive an updated (consistent) value. This should happen automatically if cache coherency is used. Memory ordering is a different problem. You can check out the famous paper Shared Memory Consistency Models: A Tutorial for more information. One of the better known examples is Dekker's Algorithm which requires sequential consistency or synchronization primitives.
EDIT2: I would like to clarify one thing. While my cache coherency example is correct, there is a situation where memory consistency might seem to overlap with it. This when stores are executed in the processor but delayed going to the cache (they are in a store queue/buffer). Since the processor's cache hasn't received an updated value, the other caches won't either. This may seem like a cache coherency problem but in reality it is not and is actually part of the memory consistency model of the ISA. In this case synchronization primitives can be used to flush the store queue to the cache. With this in mind, the Wikipedia text that you highlighted in bold is correct but this other one is still slightly wrong: The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. It should say: The keyword volatile does not guarantee a memory barrier to enforce memory consistency.
What wikipedia tells you is that volatile does not mean that a memory barrier will be inserted to enforce cache-consistency. A proper memory barrier will however enforce that memory access between multiple CPU cores is consistent, you may find reading the std::memory_order documentation helpful.

When can the CPU ignore the LOCK prefix and use cache coherency?

I originally thought cache coherency protocols such as MESI can provide pseudo-atomicity but only across individual memory-load/store instructions. If I was performing a fetch, modify, write combination of instructions, MESI-alone wouldn't be able to enforce atomicity across the first instruction to the last.
However, section 8 of the Intel reference manual Vol 3a says:
8.1.4 Effects of a LOCK Operation on Internal Processor Caches
For the P6 and more recent processor families, if the area of memory
being locked during a LOCK operation is cached in the processor that
is performing the LOCK operation as write-back memory and is
completely contained in a cache line, the processor may not assert the
LOCK# signal on the bus. Instead, it will modify the memory location
internally and allow it’s cache coherency mechanism to ensure that the
operation is carried out atomically. This operation is called “cache
locking.” The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from
simultaneously modifying data in that area.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
This seems to contradict my understanding by implying the LOCK instruction doesn't need to be used as cache coherency can be used?
There's a difference between locking as a concept, and the actual bus #lock signal - the latter is one of the means of implementing the first. Cache locking is another one that is much simpler and more efficient.
MESI protocol guarantees that if a line is held exclusively by a certain core (either modified or not), no one else has it. In this case you can perform multiple operations atomically by adding simple flag in the cache that blocks external snoops until the operations are done. This would have the same effect as the lock concept dictates since no one else may change or even observe the intermediate values.
On more complicated cases, the line is not held by a single cache (for e.g. it may be shared between several ones, or the access may be split between two cache lines and only one is in your cache - the list of scenarios is usually implementation specific and probably not disclosed by the CPU manufacturer) - in such cases you may have to resort to "heavier" cannons like the bus lock, which usually guarantees no one can do anything on the shared bus. Obviously this has a huge impact on performance so this is probably only used when you have no other choice. In most cases a simple cache-level lock should be enough. Note that new schemes like Intel TSX seem to work in a similar manner, offering optimizations when you're working from within the cache.
By the way - your assumption about pseudo-atomicity for individual instruction is also wrong - it would be correct if you referred to a single memory operation (load or store), since an instruction may include multiple ones (inc [addr] for e.g. would not be atomic without a lock). Another restriction which also appears in your quote is that the access needs to be contained in a cache line - split lines don't guarantee atomicity even within a single load or store (since they're usually implemented as 2 memory operations that are later merged).
Reading the excerpt you give, I don't find it contradictory to using of LOCK-ed instruction. For example, consider INC instruction. Without the LOCK, it can read the original value having its cache line in SHARED state which does not prevent other cores on the same cache from concurrent reading of the same value before storing the same incremented result = data race.
I interpret the quote as the data integrity is guaranteed per cache line granularity, the additional care may not be necessary when the data fits one cache line. But if the the data crosses the boundary of two cache lines, it is necessary to assert that modifications for both of them will be treated atomically.

What's the point of cache coherency?

On CPUs like x86, which provide cache coherency, how is this useful from a practical perspective? I understand that the idea is to make memory updates done on one core immediately visible on all other cores. This is a useful property. However, one can't rely too heavily on it if not writing in assembly language, because the compiler can store variable assignments in registers and never write them to memory. This means that one must still take explicit steps to make sure that stuff done in other threads is visible in the current thread. Therefore, from a practical perspective, what has cache coherency achieved?
The short story is, non-cache coherent system are exceptionally difficult to program especially if you want to maintain efficiency - which is also the main reason even most NUMA systems today are cache-coherent.
If the caches wern't coherent, the "explicit steps" would have to enforce the coherency - explicit steps are usually things like critical sections/mutexes(e.g. volatile in C/C++ is rarly enough) . It's quite hard, if not impossible for services such as mutexes to keep track of only the memory that have changes and needs to be updated in all the caches -it would probably have to update all the memory, and that is if it could even track which cores have what pieces of that memory in their caches.
Presumable the hardware can do a much better and efficient job at tracking the memory addresses/ranges that have been changed, and keep them in sync.
And, imagine a process running on core 1 and gets preempted. When it gets scheduled again, it got scheduled on core 2.
This would be pretty fatal if the caches weren't choerent as otherwise there might be remnants of the process data in the cache of core 1, which doesn't exist in core 2's cache. Though, for systems working that way, the OS would have to enforce the cache coherency as threads are scheduled - which would probably be an "update all the memory in caches between all the cores" operation, or perhaps it could track dirty pages vith the help of the MMU and only sync the memory pages that have been changed - again, the hardware likely keep the caches coherent in a more finegrainded and effcient way.
There are some nuances not covered by the great responses from the other authors.
First off, consider that a CPU doesn't deal with memory byte-by-byte, but with cache lines. A line might have 64 bytes. Now, if I allocate a 2 byte piece of memory at location P, and another CPU allocates an 8 byte piece of memory at location P + 8, and both P and P + 8 live on the same cache line, observe that without cache coherence the two CPUs can't concurrently update P and P + 8 without clobbering each others changes! Because each CPU does read-modify-write on the cache line, they might both write out a copy of the line that doesn't include the other CPU's changes! The last writer would win, and one of your modifications to memory would have "disappeared"!
The other thing to bear in mind is the distinction between coherency and consistency. Because even x86 derived CPUs use store buffers, there aren't the guarantees you might expect that instructions that have already finished have modified memory in such a way that other CPUs can see those modifications, even if the compiler has decided to write the value back to memory (maybe because of volatile?). Instead the mods may be sitting around in store buffers. Pretty much all CPUs in general use are cache coherent, but very few CPUs have a consistency model that is as forgiving as the x86's. Check out, for example, http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html for more information on this topic.
Hope this helps, and BTW, I work at Corensic, a company that's building a concurrency debugger that you may want to check out. It helps pick up the pieces when assumptions about concurrency, coherence, and consistency prove unfounded :)
Imagine you do this:
lock(); //some synchronization primitive e.g. a semaphore/mutex
globalint = somevalue;
unlock();
If there were no cache coherence, that last unlock() would have to assure that globalint are now visible everywhere, with cache coherance all you need to do is to write it to memory and let the hardware do the magic. A software solution would have keep tack of which memory exists in which caches, on which cores, and somehow make sure they're atomically in sync.
You'd win an award if you can find a software solution that keeps track of all the pieces of memory that exist in the caches that needs to be keept in sync, that's more efficient than a current hardware solution.
Cache coherency becomes extremely important when you are dealing with multiple threads and are accessing the same variable from multiple threads. In that particular case, you have to ensure that all processors/cores do see the same value if they access the variable at the same time, otherwise you'll have wonderfully non-deterministic behaviour.
It's not needed for locking. The locking code would include cache flushing if that was needed. It's mainly needed to ensure that concurrent updates by different processors to different variables in the same cache line aren't lost.
Cache coherency is implemented in hardware because the programmer doesn't have to worry about making sure all threads see the latest value of a memory location while operating in multicore/multiprocessor enviroment. Cache coherence gives an abstraction that all cores/processors are operating on a single unified cache, though every core/processor has it own individual cache.
It also makes sure the legacy multi-threaded code works as is on new processors models/multi processor systems, without making any code changes to ensure data consistency.

Resources