Is synchronization for variable change cheaper then for something else?

Is synchronization for variable change cheaper then for something else? - multithreading

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Related

Will atomic operations block other threads?

I am trying to make "atomic vs non atomic" concept settled in my mind. My first problem is I could not find "real-life analogy" on that. Like customer/restaurant relationship over atomic operations or something similar.
Also I would like to learn about how atomic operations places themselves in thread-safe programming.
In this blog post; http://preshing.com/20130618/atomic-vs-non-atomic-operations/
it is mentioned as:
An operation acting on shared memory is atomic if it completes in a
single step relative to other threads. When an atomic store is
performed on a shared variable, no other thread can observe the
modification half-complete. When an atomic load is performed on a
shared variable, it reads the entire value as it appeared at a single
moment in time. Non-atomic loads and stores do not make those
guarantees.
What is the meaning of "no other thread can observe the modification half-complete"?
That means thread will wait until atomic operation is done? How that thread know about that operation is atomic? For example in .NET I can understand if you lock the object you set a flag to block other threads. But what about atomic? How other threads know difference between atomic and non-atomic operations?
Also if above statement is true, do all atomic operations are thread-safe?

Let's clarify a bit what is atomic and what are blocks. Atomicity means that operation either executes fully and all it's side effects are visible, or it does not execute at all. So all other threads can either see state before the operation or after it. Block of code guarded by a mutex is atomic too, we just don't call it an operation. Atomic operations are special CPU instructions which conceptually are similar to usual operation guarded by a mutex (you know what mutex is, so I'll use it, despite the fact that it is implemented using atomic operations). CPU has a limited set of operations which it can execute atomically, but due to hardware support they are very fast.
When we discuss thread blocks we usually involve mutexes in conversation because code guarded by them can take quite a time to execute. So we say that thread waits on a mutex. For atomic operations situation is the same, but they are fast and we usually don't care for delays here, so it is not that likely to hear words "block" and "atomic operation" together.
That means thread will wait until atomic operation is done?
Yes it will wait. CPU will restrict access to a block of memory where the variable is located and other CPU cores will wait. Note that for performance reasons that blocks are held only between atomic operations themselves. CPU cores are allowed to cache variables for read.
How that thread know about that operation is atomic?
Special CPU instructions are used. It is just written in your program that particular operation should be performed in atomic manner.
Additional information:
There are more tricky parts with atomic operations. For example on modern CPUs usually all reads and writes of primitive types are atomic. But CPU and compiler are allowed to reorder them. So it is possible that you change some struct, set a flag that telling that it is changed, but CPU reorders writes and sets flag before the struct is actually committed to memory. When you use atomic operations usually some additional efforts are done to prevent undesired reordering. If you want to know more, you should read about memory barriers.
Simple atomic stores and writes are not that useful. To make maximal use of atomic operations you need something more complex. Most common is a CAS - compare and swap. You compare variable with a value and change it only if comparison was successful.

On typical modern CPUs, atomic operations are made atomic this way:
When an instruction is issued that accesses memory, the core's logic attempts to put the core's cache in the correct state to access that memory. Typically, this state will be achieved before the memory access has to happen, so there is no delay.
While another core is performing an atomic operation on a chunk of memory, it locks that memory in its own cache. This prevents any other core from acquiring the right to access that memory until the atomic operation completes.
Unless two cores happen to be performing accesses to many of the same areas of memory and many of those accesses are writes, this typically won't involve any delays at all. That's because the atomic operation is very fast and typically the core knows in advance what memory it will need access to.
So, say a chunk of memory was last accessed on core 1 and now core 2 wants to do an atomic increment. When the core's prefetch logic sees the modification to that memory in the instruction stream, it will direct the cache to acquire that memory. The cache will use the intercore bus to take ownership of that region of memory from core 1's cache and it will lock that region in its own cache.
At this point, if another core tries to read or modify that region of memory, it will be unable to acquire that region in its cache until the lock is released. This communication takes place on the bus that connects the caches and precisely where it takes place depends on which cache(s) the memory was in. (If not in cache at all, then it has to go to main memory.)
A cache lock is not normally described as blocking a thread both because it is so fast and because the core is usually able to do other things while it's trying to acquire the memory region that is locked in the other cache. From the point of view of the higher-level code, the implementation of atomics is typically considered an implementation detail.
All atomic operations provide the guarantee that an intermediate result will not be seen. That's what makes them atomic.

The atomic operations you describe are instructions within the processor and the hardware will make sure that a read cannot happen on a memory location until the atomic write is complete. This guarantees that a thread either reads the value before write or the value after the write operation, but nothing in-between - there's no chance of reading half of the bytes of the value from before the write and the other half from after the write.
Code running against the processor is not even aware of this block but it's really no different from using a lock statement to make sure that a more complex operation (made up of many low-level instructions) is atomic.
A single atomic operation is always thread-safe - the hardware guarantees that the effect of the operation is atomic - it'll never get interrupted in the middle.
A set of atomic operations is not atomic in the vast majority of cases (I'm not an expert so I don't want to make a definitive statement but I can't think of a case where this would be different) - this is why locking is needed for complex operations: the entire operation may be made up of multiple atomic instructions but the whole of the operation may still be interrupted between any of those two instructions, creating the possibility of another thread seeing half-baked results. Locking ensures that code operating on shared data cannot access that data until the other operation completes (possibly over several thread switches).
Some examples are shown in this question / answer but you find many more by searching.

Being "atomic" is an attribute that applies to an operation which is enforced by the implementation (either the hardware or the compiler, generally speaking). For a real-life analogy, look to systems requiring transactions, such as bank accounts. A transfer from one account to another involves a withdrawal from one account and a deposit to another, but generally these should be performed atomically - there is no time when the money has been withdrawn but not yet deposited, or vice versa.
So, continuing the analogy for your question:
What is the meaning of "no other thread can observe the modification half-complete"?
This means that no thread could observe the two accounts in a state where the withdrawal had been made from one account but it had not been deposited in another.
In machine terms, it means that an atomic read of a value in one thread will not see a value with some bits from before an atomic write by another thread, and some bits from after the same write operation. Various operations more complex than just a single read or write can also be atomic: for instance, "compare and swap" is a commonly implemented atomic operation that checks the value of a variable, compares it to a second value, and replaces it with another value if the compared values were equal, atomically - so for instance, if the comparison succeeds, it is not possible for another thread to write a different value in between the compare and the swap parts of the operation. Any write by another thread will either be performed wholly before or wholly after the atomic compare-and-swap.
The title to your question is:
Will atomic operations block other threads?
In the usual meaning of "block", the answer is no; an atomic operation in one thread won't by itself cause execution to stop in another thread, although it may cause a livelock situation or otherwise prevent progress.
That means thread will wait until atomic operation is done?
Conceptually, it means that they will never need to wait. The operation is either done, or not done; it is never halfway done. In practice, atomic operations can be implemented using mutexes, at a significant performance cost. Many (if not most) modern processors support various atomic primitives at the hardware level.
Also if above statement is true, do all atomic operations are thread-safe?
If you compose atomic operations, they are no longer atomic. That is, I can do one atomic compare-and-swap operation followed by another, and the two compare-and-swaps will individually be atomic, but they are divisible. Thus you can still have concurrency errors.

Atomic operation means the system performs an operation in its entirety or not at all. Reading or writing an int64 is atomic (64bits System & 64bits CLR) because the system read/write the 8 bytes in one single operation, readers do not see half of the new value being stored and half of the old value. But be carefull :
long n = 0; // writing 'n' is atomic, 64bits OS & 64bits CLR
long m = n; // reading 'n' is atomic
....// some code
long o = n++; // is not atomic : n = n + 1 is doing a read then a write in 2 separate operations
To make atomicity happens to the n++ you can use the Interlocked API :
long o = Interlocked.Increment(ref n); // other threads are blocked while the atomic operation is running

Deciding the critical section of kernel code

Hi I am writing kernel code which intends to do process scheduling and multi-threaded execution. I've studied about locking mechanisms and their functionality. Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking (mutex/semaphores/spinlocks)?
I know that where ever there is chance of concurrency in part of code, we require lock. But how do we decide, what if we miss and test cases don't catch them. Earlier I wrote code for system calls and file systems where I never cared about taking locks.

Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking?
Any object (global variable, field of the structure object, etc.), accessed concurrently when one access is write access requires some locking discipline for access.
But how do we decide, what if we miss and test cases don't catch them?
Good practice is appropriate comment for every declaration of variable, structure, or structure field, which requires locking discipline for access. Anyone, who uses this variable, reads this comment and writes corresponded code for access. Kernel core and modules tend to follow this strategy.
As for testing, common testing rarely reveals concurrency issues because of their low probability. When testing kernel modules, I would advice to use Kernel Strider, which attempts to prove correctness of concurrent memory accesses or RaceHound, which increases probability of concurrent issues and checks them.

It is always safe to grab a lock for the duration of any code that accesses any shared data, but this is slow since it means only one thread at a time can run significant chunks of code.
Depending on the data in question though, there may be shortcuts that are safe and fast. If it is a simple integer ( and by integer I mean the native word size of the CPU, i.e. not a 64 bit on a 32 bit cpu ), then you may not need to do any locking: if one thread tries to write to the integer, and the other reads it at the same time, the reader will either get the old value, or the new value, never a mix of the two. If the reader doesn't care that he got the old value, then there is no need for a lock.
If however, you are updating two integers together, and it would be bad for the reader to get the new value for one and the old value for the other, then you need a lock. Another example is if the thread is incrementing the integer. That normally involves a read, add, and write. If one reads the old value, then the other manages to read, add, and write the new value, then the first thread adds and writes the new value, both believe they have incremented the variable, but instead of being incremented twice, it was only incremented once. This needs either a lock, or the use of an atomic increment primitive to ensure that the read/modify/write cycle can not be interrupted. There are also atomic test-and-set primitives so you can read a value, do some math on it, then try to write it back, but the write only succeeds if it still holds the original value. That is, if another thread changed it since the time you read it, the test-and-set will fail, then you can discard your new value and start over with a read of the value the other thread set and try to test-and-set it again.
Pointers are really just integers, so if you set up a data structure then store a pointer to it where another thread can find it, you don't need a lock as long as you set up the structure fully before you store its address in the pointer. Another thread reading the pointer ( it will need to make sure to read the pointer only once, i.e. by storing it in a local variable then using only that to refer to the structure from then on ) will either see the new structure, or the old one, but never an intermediate state. If most threads only read the structure via the pointer, and any that want to write do so either with a lock, or an atomic test-and-set of the pointer, this is sufficient. Any time you want to modify any member of the structure though, you have to copy it to a new one, change the new one, then update the pointer. This is essentially how the kernel's RCU ( read, copy, update ) mechanism works.

Ideally, you must enumerate all the resources available in your system , the related threads and communication, sharing mechanism during design. Determination of the following for every resource and maintaining a proper check list whenever change is made can be of great help :
The duration for which the resource will be busy (Utilization of resource) & type of lock
Amount of tasks queued upon that particular resource (Load) & priority
Type of communication, sharing mechanism related to resource
Error conditions related to resource
If possible, it is better to have a flow diagram depicting the resources, utilization, locks, load, communication/sharing mechanism and errors.
This process can help you in determining the missing scenarios/unknowns, critical sections and also in identification of bottlenecks.
On top of the above process, you may also need certain tools that can help you in testing / further analysis to rule out hidden problems if any :
Helgrind - a Valgrind tool for detecting synchronisation errors.
This can help in identifying data races/synchronization issues due
to improper locking, the lock ordering that can cause deadlocks and
also improper POSIX thread API usage that can have later impacts.
Refer : http://valgrind.org/docs/manual/hg-manual.html
Locksmith - For determining common lock errors that may arise during
runtime or that may cause deadlocks.
ThreadSanitizer - For detecting race condtion. Shall display all accesses & locks involved for all accesses.
Sparse can help to lists the locks acquired and released by a function and also identification of issues such as mixing of pointers to user address space and pointers to kernel address space.
Lockdep - For debugging of locks
iotop - For determining the current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
LTTng - For tracing race conditions and interrupt cascades possible. (A successor to LTT - Combination of kprobes, tracepoint and perf functionalities)
Ftrace - A Linux kernel internal tracer for analysing /debugging latency and performance related issues.
lsof and fuser can be handy in determining the processes having lock and the kind of locks.
Profiling can help in determining where exactly the time is being spent by the kernel. This can be done with tools like perf, Oprofile.
The strace can intercept/record system calls that are called by a process and also the signals that are received by a process. It shall show the order of events and all the return/resumption paths of calls.

How is atomicity implemented by the CPU?

I have been told/read online the cache coherency protocol MESI/MESIF:
http://en.wikipedia.org/wiki/MESI_protocol
also enforces atomicity- for example for a lock. However, this really really doesn't make sense to me for the following reasons:
1) MESI manages cache access for all instructions. If MESI also enforces atomicity, how do we get race conditions? Surely all instructions would be atomic and we'd never get race conditions?
2) If MESI gurarantees atomicity, whats the point of the LOCK prefix?
3) Why do people say atomic instructions carry overhead- if they are implemented using the same cache coherency model as all other x86 instructions?
Generally-speaking could somebody please explain how the CPU implements locks at a low-level?

The LOCK prefix has one purpose, that is taking a lock on that address followed by instructing MESI to flush that cache line on all other processors followed so that reading or writing that address by all other processors (or hardware devices!) blocks until the lock is released (which it is at the end of the instruction).
The LOCK prefix is slow (several hundred cycles) because it has to synchronize the bus for the duration and the bus speed and latency is much lower than CPU speed.
General operation of LOCK instruction
1. validate
2. establish address lock on cache line
3. wait for all processors to flush (MESI kicks in here)
4. perform operation within cache line
5. flush cache line to RAM (which releases the lock)
Disclaimer: Much of this comes from the documentation of the Pentium F00F bug (where the validate part was erroneously done after establish lock) and so might be out of date.

As #voo said, you are confusing coherency with atomicity.
Cache coherency covers many scenarios, but the basic example is when 2 different agents (cores on a multicore chip, processors on a multi-socket one, etc..), access the same line, they may both have it cached locally. MESI guarantees that when one of them writes a new value, all other stale copies are first invalidated, to prevent usage of the old value. As a by-product, this in fact guarantees atomicity of a single read or write access to memory, on a cacheline granularity, which is part of the CPU charter on x86 (and many other architectures as well). It does more than that - it's a crucial part of memory ordering and consistency guarantees that the CPU provides you.
It does not, however, provide any larger scale of atomicity, which is crucial for handling concepts like thread-safety and critical sections. What you are referring to with the locked operations is a read-modify-write flow, which is not guaranteed to be atomic by default (at least not on common CPUs), since it consists of 2 distinct accesses to memory. without a lock in place, the CPU may receive a snoop in between, and must respond according to the MESI protocol. The following scenario is perfectly legal for e.g.:
core 0 | core 1
---------------------------------
y = read [x] |
increment y | store [x] <- z
|
store [x] <- y |
Meaning that your memory increment operation on core 0 didn't work as expected. If [x] holds a mutex for e.g, you may think it was free and that you managed to grab it, while core 1 already took it.
Having the read-modify-write operation on core 0 locked (and x86 provides many possible options, locked add/inc, locked compare-exchange, etc..), would stall the other cores until the operation is done, so it essentially enhances the inter-core protocol to allow rejecting snoops.
It should be noted that a simple MESI protocol, if used correctly with alternative guarantees (like fences), can provide lock-free methods to perform atomic operations.

I think the point is that while the cache is involved in ordinary memory operations, it is required to do more for atomic operations than for your run of the mill ones.
Added later...
For ordinary operations:
when writing to memory, your typical core/cpu will maintain a write
queue, so that once the write has been dispatched, the core/cpu
continues processing instructions, while some other mechanics deals
with emptying the queue of pending writes -- negotiating with the
cache as required. On some processors the pending writes need not be
written away in the order they were put into the queue.
when reading from memory, if the required value is not immediately
available, the core/cpu may continue processing instructions, while
some other mechanics perform the required reads -- negotiating with
the cache as required.
all of which is designed to allow the core/cpu to keep going, decoupled as far as possible from the truely ghastly business of accessing real memory, via layers of cache, which is all horribly slow.
Now, for your atomic operations, the state of the core/cpu has to be synchronised with the state of the cache/memory.
So, for a "release" store: (a) everything in the write queue must be completed, before (b) the "release" write itself is completed, before (c) normal processing can continue. So all the benefits of the asynchronous writing of cache/memory may have to be foregone, until the atomic write completes. Similarly, for an "acquire" load: any reads which come after the "acquire" read must be delayed.
As it happens, the x86 is remarkably "well behaved". It does not reorder writes, so a "release" store does not need any extra work to ensure that it comes after any earlier stores. On the read side it also does not need to do anything special for an "acquire". If two or more cores/cpus are reading and writing the same piece of memory, then there will be more invalidating and reloading of cache lines, with the attendant overhead. When doing a "sequentially consistent" store, it has to be followed by an explicit mfence operation, which will stall the cpu/core until all writes have been flushed from the write queue. It is true that "sequentially consistent" is easier to think about... but for code where access to shared data is protected by locks, "acquire"/"release" is sufficient.
For your atomic "read-modify-write" and conditional versions thereof, the interaction with the cache/memory is even stronger. The cpu/core executing the operation must not only synchronise itself with the state of cache/memory, it must also arrange for other cpus/cores which access the object of the atomic operation to stall until it is complete and has been written away (committed to cache/memory). The impact of this will depend on whether there is any actual contention with other cpu(s)/core(s) at that moment.

How locking is implemented?

i have following code:
while(lock)
;
lock = 1;
// critical section
lock = 0;
As reading or changing lock value is in itself a multi-instruction
read lock
change value
write it
If it happens like:
1) One thread reads the lock and stops there
2) Another thread reads it and sees it is free; lock it and do something untill half
3) First thread wakes up and goes into CS
SO how would locking would be implmented in system ?
Placing variables over top of another variables is not right : it would be like Guarding the guard ?
Stopping other processors threads is also not right ?

It is 100% platform specific. Generally, the CPU provides some form of atomic operation such as exchange or compare and swap. A typical lock might work like this:
1) Create: Store 0 (unlocked) in the variable.
2) Lock: Atomically attempt to switch the value of the variable from 0 (unlocked) to 1 (locked). If we failed (because it wasn't unlocked to begin with), let the CPU rest a bit, and then retry. Use a memory barrier to ensure no future memory operations sneak behind this one.
3) Unlock: Use a memory barrier to ensure previous memory operations don't sneak past this one. Atomically write 0 (unlocked) to the variable.
Note that you really don't need to understand this unless you want to design your own synchronization primitives. And if you want to do that, you need to understand an awful lot more. It's certainly a good idea for every programmer to have a general idea of what he's making the hardware do. But this is an area filled with seriously heavy wizardry. There are so many, many ways this can go horribly wrong. So just use the locking primitives provided by the geniuses who made your platform, compiler, and threading library. Here be dragons.
For example, SMP Pentium Pro systems have an erratum that requires special handling in the unlock operation. A naive implementation of the lock algorithm will cause the branch prediction logic to expect the operation to keep spinning, incurring a massive performance penalty at the worst possible time -- when you first acquire the lock. A naive implementation of the lock algorithm may cause two cores each waiting for the same lock to saturate the bus, slowing the CPU that needs to get work done in order to release the lock to a crawl. These all require heavy wizardry and deep understanding of the hardware to deal with.

In a course I studied at Uni, a possible firmware solution for implementing locks was presented in the form of the "atomicity bit" associated to a memory operation initiated by a processor.
Basically, when locking, you'll notice that you have a sequence of operations that need to be executed atomically: test the value of the flag and, if not set, set it to locked, otherwise try again. This sequence can be made atomic by associating a bit with each memory request send by the CPU. The first N-1 operations will have the bit set, while the last one will have it unset, to mark the end of the atomic sequence.
When the memory module (there can be several modules) where the flag data is stored receives the request for the first operation in the sequence (whose bit is set), it will serve it and not take requests from any other CPU until the CPU that initiated the atomic sequence sends a request with an unset atomicity bit (since these transactions are usually short, a coarse-grain approach like this is acceptable). Note that this is usually made easier by the assembler providing specialized instructions of type "compare-and-set", that do exactly what I mentioned before.

Real World Examples of read-write in concurrent software

I'm looking for real world examples of needing read and write access to the same value in concurrent systems.
In my opinion, many semaphores or locks are present because there's no known alternative (to the implementer,) but do you know of any patterns where mutexes seem to be a requirement?
In a way I'm asking for candidates for the standard set of HARD problems for concurrent software in the real world.

What kind of locks are used depends on how the data is being accessed by multiple threads. If you can fine tune the use case, you can sometimes eliminate the need for exclusive locks completely.
An exclusive lock is needed only if your use case requires that the shared data must be 100% exact all the time. This is the default that most developers start with because that's how we think about data normally.
However, if what you are using the data for can tolerate some "looseness", there are several techniques to share data between threads without the use of exclusive locks on every access.
For example, if you have a linked list of data and if your use of that linked list would not be upset by seeing the same node multiple times in a list traversal and would not be upset if it did not see an insert immediately after the insert (or similar artifacts), you can perform list inserts and deletes using atomic pointer exchange without the need for a full-stop mutex lock around the insert or delete operation.
Another example: if you have an array or list object that is mostly read from by threads and only occasionally updated by a master thread, you could implement lock-free updates by maintaining two copies of the list: one that is "live" that other threads can read from and another that is "offline" that you can write to in the privacy of your own thread. To perform an update, you copy the contents of the "live" list into the "offline" list, perform the update to the offline list, and then swap the offline list pointer into the live list pointer using an atomic pointer exchange. You will then need some mechanism to let the readers "drain" from the now offline list. In a garbage collected system, you can just release the reference to the offline list - when the last consumer is finished with it, it will be GC'd. In a non-GC system, you could use reference counting to keep track of how many readers are still using the list. For this example, having only one thread designated as the list updater would be ideal. If multiple updaters are needed, you will need to put a lock around the update operation, but only to serialize updaters - no lock and no performance impact on readers of the list.
All the lock-free resource sharing techniques I'm aware of require the use of atomic swaps (aka InterlockedExchange). This usually translates into a specific instruction in the CPU and/or a hardware bus lock (lock prefix on a read or write opcode in x86 assembler) for a very brief period of time. On multiproc systems, atomic swaps may force a cache invalidation on the other processors (this was the case on dual proc Pentium II) but I don't think this is as much of a problem on current multicore chips. Even with these performance caveats, lock-free runs much faster than taking a full-stop kernel event object. Just making a call into a kernel API function takes several hundred clock cycles (to switch to kernel mode).
Examples of real-world scenarios:
producer/consumer workflows. Web service receives http requests for data, places the request into an internal queue, worker thread pulls the work item from the queue and performs the work. The queue is read/write and has to be thread safe.
Data shared between threads with change of ownership. Thread 1 allocates an object, tosses it to thread 2 for processing, and never wants to see it again. Thread 2 is responsible for disposing the object. The memory management system (malloc/free) must be thread safe.
File system. This is almost always an OS service and already fully thread safe, but it's worth including in the list.
Reference counting. Releases the resource when the number of references drops to zero. The increment/decrement/test operations must be thread safe. These can usually be implemented using atomic primitives instead of full-stop kernal mutex locks.

Most real world, concurrent software, has some form of requirement for synchronization at some level. Often, better written software will take great pains to reduce the amount of locking required, but it is still required at some point.
For example, I often do simulations where we have some form of aggregation operation occurring. Typically, there are ways to prevent locking during the simulation phase itself (ie: use of thread local state data, etc), but the actual aggregation portion typically requires some form of lock at the end.
Luckily, this becomes a lock per thread, not per unit of work. In my case, this is significant, since I'm typically doing operations on hundreds of thousands or millions of units of work, but most of the time, it's occuring on systems with 4-16 PEs, which means I'm usually restricting to a similar number of units of execution. By using this type of mechanism, you're still locking, but you're locking between tens of elements instead of potentially millions.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string