Why are loads required when using atomics

Why are loads required when using atomics - multithreading

When using atomics in Go (and other languages like c++) its advised to use an atomic load operation for reading a concurrently written value.
If the definition (as I understand it) of an atomic write (be it a store or an integer increment) is that no thread can view a partial write, why is an atomic load required?
Would a plain load of the memory address always be safe from a torn view, if only atomic stores are used on that memory address?

This answer is mainly for C and C++ as I am not directly familiar with atomics in many other languages, but I suspect they are similar.
It's true that many actual machines work this way, in some cases. For instance, on x86-64, ordinary load instructions are atomic with respect to ordinary stores or locked read-modify-write instructions. So for types that can be loaded with a single instruction, you could in principle use ordinary assignment and avoid tearing.
But there are cases where this would not work. For instance:
Types which are not lock-free (e.g. structs of more than a couple words). In this case, several instructions are needed to load or store, and so a lock must be taken around them, or tearing is entirely possible. The atomic load function knows to take the lock, an ordinary assignment wouldn't.
Types which can be lock-free but need special handling. For example, 64-bit long long int on x86-32. An ordinary load would execute two 32-bit integer load instructions (which are individually atomic), and so even if the store is atomic, it could happen in between. But the atomic load function can emit a 64-bit floating point or SIMD load, which is less efficient but does it in one atomic instruction. Example on godbolt.
As such, the language promises atomicity only when the store and load both use the provided atomic functions. - your "definition" is not accurate for C or C++. By requiring the programmer to always use an atomic load, the language provides a "hook" where implementations can take appropriate action if needed. In cases where an ordinary load would suffice, the implementation can optimize accordingly and nothing is lost.
Another point is that the atomic load provides a place to put a memory barrier when one is wanted (any ordering except relaxed). Some architectures include load instructions with a built-in barrier (e.g. ARM64's ldar), and making the barrier part of the load at the language level makes it easier for the compiler to take advantage of this. If you had to do a regular assignment followed by a call to a barrier function, it would be harder for the compiler to figure out that it could optimize them into ldar.

Related

Why use an AtomicU32 in Rust, given that U32 already implements Sync?

The std::sync::atomic module contains a number of atomic variants of primitive types, with the stated purpose that these types are now thread-safe. However, all the primatives that correspond to the atomic types already implement Send and Sync, and should therefore already be thread-safe. What's the reasoning behind the Atomic types?

Generally, non-atomic integers are safe to share across threads because they're immutable. If you attempt to modify the value, you implicitly create a new one in most cases because they're Copy. However, it isn't safe to share a mutable reference to a u32 across threads (or have both mutable and immutable references to the same value), which practically means that you won't be able to modify the variable and have another thread see the results. An atomic type has some additional behavior which makes it safe.
In the more general case, using non-atomic operations doesn't guarantee that a change made in one thread will be visible in another. Many architectures, especially RISC architectures, do not guarantee that behavior without additional instructions.
In addition, compilers often reorder accesses to memory in functions and in some cases, across functions, and an atomic type with an appropriate barrier is required to indicate to the compiler that such behavior is not wanted.
Finally, atomic operations are often required to logically update the contents of a variable. For example, I may want to atomically add 1 to a variable. On a load-store architecture such as ARM, I cannot modify the contents of memory with an add instruction; I can only perform arithmetic on registers. Consequently, an atomic add is multiple instructions, usually consisting of a load-linked, which loads a memory location, the add operation on the register, and then a store-conditional, which stores the value if the memory location has not changed. There's also a loop to retry if it has.
These are why atomic operations are needed and generally useful across languages. So while one can use non-atomic operations in non-Rust languages, they don't generally produce useful results, and since one typically wants one's code to function correctly, atomic operations are desirable for correctness. Rust's atomic types guarantee this behavior by generating suitable instructions and therefore can be safely shared across threads.

What is the opposite of a "full memory barrier"?

I sometimes see the term "full memory barrier" used in tutorials about memory ordering, which I think means the following:
If we have the following instructions:
instruction 1
full_memory_barrier
instruction 2
Then instruction 1 is not allowed to be reordered to below full_memory_barrier, and instruction 2 is not allowed to be reordered to above full_memory_barrier.
But what is the opposite of a full memory barrier, I mean is there something like a "semi memory barrier" that only prevent the CPU from reordering instructions in one direction?
If there is such a memory barrier, I don't see its point, I mean if we have the following instructions:
instruction 1
memory_barrier_below_to_above
instruction 2
Assume that memory_barrier_below_to_above is a memory barrier that prevents instruction 2 from being reordered to above memory_barrier_below_to_above, so the following will not be allowed:
instruction 2
instruction 1
memory_barrier_below_to_above
But the following will be allowed (which makes this type of memory barrier pointless):
memory_barrier_below_to_above
instruction 2
instruction 1

http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ explains different kinds of barriers, like LoadLoad or StoreStore. A StoreStore barrier only prevents stores from reordering across the barrier, but loads can still execute out of order.
On real CPUs, any barriers that include StoreLoad block everything else, too, and thus are called "full barriers". StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache.
Barrier examples:
strong weak
x86 mfence none needed unless you're using NT stores
ARM dmb sy isb, dmb st, dmb ish, etc.
POWER hwsync lwsync, isync, ...
ARM has "inner" and "outer shareable domains". I don't really know what that means, haven't had to deal with it, but this page documents the different forms of Data Memory Barrier available. dmb st only waits for earlier stores to complete, so I think it's only a StoreStore barrier, and thus too weak for a C++11 release-store which also needs to order earlier loads against LoadStore reordering. See also C/C++11 mappings to processors: note that seq-cst can be achieved with full-barriers around every store, or with barriers before loads as well as before stores. Making loads cheap is usually best, though.
ARM ISB flushes the instruction caches. (ARM doesn't have coherent i-cache, so after writing code to memory, you need an ISB before you can reliably jump there and execute those bytes as instructions.)
POWER has a large selection of barriers available, including Light-Weight (non-full barrier) and Heavy-Weight Sync (full barrier) mentioned in Jeff Preshing's article linked above.
A one-directional barrier is what you get from a release-store or an acquire-load. A release-store at the end of a critical section (e.g. to unlock a spinlock) has to make sure loads/stores inside the critical section don't appear later, but it doesn't have to delay later loads until after the lock=0 becomes globally visible.
Jeff Preshing has an article about this, too: Acquire and Release semantics
The "full" vs. "partial" barrier terminology is not usually used for the one-way reordering restriction of a release-store or acquire-load. An actual release fence (in C++11, std::atomic_thread_fence(std::memory_order_release)) does block reordering of stores in both directions, unlike a release-store on a specific object.
This subtle distinction has caused confusion in the past (even among experts!). Jeff Preshing has yet another excellent article explaining it: Acquire and Release Fences Don't Work the Way You'd Expect.
You're right that a one-way barrier that wasn't tied to a store or a load wouldn't be very useful; that's why such a thing doesn't exist. :P It could reorder an unbounded distance in one direction and leave all the operations to reorder with each other.
What exactly does atomic_thread_fence(memory_order_release) do?
C11 (n1570 Section 7.17.4 Fences) only defines it in terms of creating a synchronizes-with relationship with an acquire-load or acquire fence, when the release-fence is used before an atomic store (relaxed or otherwise) to the same object the load accesses. (C++11 has basically the same definition, but discussion with #EOF in comments brought up the C11 version.)
This definition in terms of the net effect, not the mechanism for achieving it, doesn't directly tell us what it does or doesn't allow. For example, subsection 3 says
3) A release fence A synchronizes with an atomic operation B that performs an acquire
operation on an atomic object M if there exists an atomic operation X such that A is
sequenced before X, X modifies M, and B reads the value written by X or a value written
by any side effect in the hypothetical release sequence X would head if it were a release
operation
So in the writing thread, it's talking about code like this:
stuff // including any non-atomic loads/stores
atomic_thread_fence(mo_release) // A
M=X // X
// threads that see load(M, acquire) == X also see stuff
The syncs-with means that threads which see the value from M=X (directly or indirectly through a release-sequence) also see all the stuff and read non-atomic variables without Data Race UB.
This lets us say something about what is / isn't allowed:
It's a 2-way barrier for atomic stores. They can't cross it in either direction, so the barrier's location in this thread's memory order is bounded by atomic stores before and after. Any earlier store can be part of stuff for some M, any later store can be the M that an acquire-load (or load + acquire-fence) synchronizes with.
It's a one-way barrier for atomic loads: earlier ones need to stay before the barrier, but later ones can move above the barrier. M=X can only be a store (or the store part of a RMW).
It's a one-way barrier for non-atomic loads/stores: non-atomic stores can be part of the stuff, but can't be X because they're not atomic. It's ok to allow later loads / stores in this thread to appear to other threads before the M=X. (If a non-atomic variable is modified before and after the barrier, then nothing could safely read it even after a syncs-with this barrier, unless there's also a way for a reader to stop this thread from continuing on and creating Data Race UB. So a compiler can and should reorder foo=1; fence(release); foo=2; into foo=2; fence(release);, eliminating the dead foo=1 store. But sinking foo=1 to after the barrier is only legal on the technicality that nothing could tell the difference without UB.)
As an implementation detail, a C11 release fence may be stronger than this (e.g. a 2-way barrier for more kinds of compile-time reordering), but not weaker. On some architectures (like ARM), the only option that's strong enough might be a full barrier asm instruction. And for compile-time reordering restrictions, a compiler might not allow these 1-way reorderings just to keep the implementation simple.
Mostly this combined 2-way / 1-way nature only matters for compile-time reordering. CPUs don't make the distinction between atomic vs. non-atomic stores. Non-atomic is always the same asm instruction as relaxed atomic (for objects that fit in a single register).
CPU barrier instructions that make a core wait until earlier operations are globally visible are typically 2-way barriers; they're specified in terms of operations becoming globally visible in a coherent view of memory shared by all cores, rather than the C/C++11 style of creating syncs-with relations. (Beware that operations can potentially become visible to some other threads before they become globally visible to all threads: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?.
But with just barriers against reordering within a physical core, sequential consistency can be recovered.)
A C++11 release-fence needs LoadStore + StoreStore barriers, but not LoadLoad. A CPU that lets you get just those 2 but not all 3 of the "cheap" barriers would let loads reorder in one direction across the barrier instruction while blocking stores in both directions.
Weakly-ordered SPARC is in fact like this, and uses the LoadStore and so on terminology (that's where Jeff Preshing took the terminology for his articles). http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/ shows how they're used. (More recent SPARCs use a TSO (Total Store Order) memory model. I think this is like x86, where the hardware gives the illusion of memory ops happening in program order except for StoreLoad reordering.)

What is the use-case of atomic read

I understand that atomic read serializes the read operations that performed by multiple threads.
What I don't understand is what is the use case?
More interestingly, I've found some implementation of atomic read which is
static inline int32_t ASMAtomicRead32(volatile int32_t *pi32)
{
return *pi32;
}
Where the only distinction to regular read is volatile. Does it mean that atomic read is the same as volatile read?

I understand that atomic read serializes the read operations that performed by multiple threads.
It's rather wrong. How you can ensure the order of reads if there is no write which stores a different value? Even when you have both read and write, it's not necessarily serialized unless correct memory semantics is used in conjunction with both the read & write operations, e.g. 'store-with-release' and 'load-with-acquire'. In your particular example, the memory semantics is relaxed. Though on x86, one can imply acquire semantics for each load and release for each store (unless non-temporal stores are used).
What I don't understand is what is the use case?
atomic reads must ensure that the data is read in one shot and no other thread can store a part of the data in the between. Thus it usually ensures the alignment of the atomic variable (since the read of aligned machine word is atomic) or work-arounds non-aligned cases using more heavy instructions. And finally, it ensures that the read is not optimized out by the compiler nor reordered across other operations in this thread (according to the memory semantics).
Does it mean that atomic read is the same as volatile read?
In a few words, volatile was not intended for such a use-case but sometimes can be abused for it when other requirements are met as well. For your example, my analysis is the following:
int32_t is likely a machine word or less - ok.
usually, everything is aligned at least on 4 bytes boundary, though there is no guarantee in your example
volatile ensures the read is not optimized out
the is no guarantee it will not be reordered either by processor (ok for x86) or by compiler (bad)
Please refer to Arch's blog and Concurrency: Atomic and volatile in C++11 memory model for the details.

Usage of registers by the compiler in multithreaded program

It is a general question but:
In a multithreaded program, is it safe for the compiler to use registers to temporarily store global variables?
I think its not, since storing global variables in registers may change saved values
for other threads.
And how about using registers to store local variables defined within a function?
I think it is ok,since no other thread will be able to get these variables.
Please correct me if im wrong.
Thank you!

Things are much more complicated than you think they are.
Even if the compiler stores a value to memory, the CPU generally does not immediately push the data out to RAM. It stores it in a cache (and some systems have 2 or 3 levels of caches between the processor and the memory).
To make things worse, the order of instructions that the compiler decides, may not be what actually gets executed as many processors can reorder instructions (and even sub-parts of instructions) in their own pipelines.
In general, in a multithreaded environment you should personally take care to never access (either read or write) the same memory from two separate threads unless one of the following is true:
you are using one of several special atomic operations that ensure proper synchronization.
you have used one of several synchronization operations to "reserve" access to shared data and then to "relinquish" it. These do include the required memory barriers that also guarantee the data is what it's supposed to be.
You may want to read http://en.wikipedia.org/wiki/Memory_ordering#Memory_barrier_types and http://en.wikipedia.org/wiki/Memory_barrier
If you are ready for a little headache and want to see how complicated things can actually get, here is your evening lecture Memory Barriers: a Hardware View for Software Hackers.

'Safe' is not really the right word to use. Many higher level languages (eg. C) do not have a threading model and so the language specification says nothing about mutli-threaded interactions.
If you are not using any kind of locking primitives then you have no guarantees what so ever about how the different threads interact. So the compiler is within its rights to use registers for global variables.
Even if you are using locking the behaviour can still be tricky: if you read a variable, then grab a lock and then read the variable again the compiler still has no way of knowing if it has to read the variable from memory again, or can use the earlier value it stored in a register.
In C/C++ declaring a variable as volatile will force the compiler to always reload the variable from memory and solve this particular instance.
There are also 'Interlocked*' primitives on most systems that have guaranteed atomicity semantics which can be used to ensure certain operations are threadsafe. Locking primitives are typically built on these low level operations.

In a multithreaded program, you have one of two cases: if it's running on a uniprocessor (single core, single CPU), then switching between threads is handled like switching between processes (although it's not quite as much work since the threads operate in the same virtual memory space) - all registers of one thread are saved during the transition to another thread, so using registers for whatever purpose is fine. This is the job of the context switch routines that the OS uses, and the register set is considered part of a threads (or processes) context. If you have a multiprocessor system - either multiple CPUs or multiple cores on a single CPU - each processor has its own distinct set of registers, so again, using registers for storing things is fine. On top of that, of course, context switching on a particular CPU will save the registers of the old thread/process before switching to the new one, so everything is preserved.
That said, on some architectures and/or with some OSes, there might be specific exceptions to that, because certain registers are reserved by the ABI for specific uses by the OS or by the libraries that provide an interface to the OS, but your compiler(s) generally have that type of knowledge of your platform built in. You need to be aware of them, though, if you're doing inline assembly or certain other "low-level" things...

Where is InterlockedRead?

Win32 api has a set of InterlockedXXX functions to atomically and synchronously manipulate simple variables, however there doesn't seem to be any InterlockedRead function, to simply retrive the value of the variable. How come?
MSDN says that:
Simple reads and writes to properly-aligned 32-bit variables are atomic operations
but adds:
However, access is not guaranteed to be synchronized. If two threads are reading and writing from the same variable, you cannot determine if one thread will perform its read operation before the other performs its write operation.
Which means, as I understand it, that a simple read operation of a variable can take place while another, say, InterlockedAdd operation is in place. So why isn't there an interlocked function to read a variable?
I guess the value can be read as the result InterlockedAdd-ing zero, but that doesn't seem the right way to go.

The normal way of implementing this is to use a compare-exchange operation (e.g. InterlockedCompareExchange64) where both values are the same. I have a sneaking suspicion this can be performed more efficiently than an add of 0 for some reason, but I have no evidence to back this up.
Interestingly, .NET's Interlocked class didn't gain a Read method until .NET 2.0. I believe that Interlocked.Read is implemented using Interlocked.CompareExchange. (Note that the documentation for Interlocked.Read strikes me as somewhat misleading - it talks about atomicity, but not volatility, which means something very specific on .NET. I'm not sure what the Win32 memory model guarantees about visibility of newly written values from a different thread, if anything.)

I think that your interpretation of "not synchronized" is wrong. Simple reads are atomic, but you have to take care of reordering and memory visibility issues yourself. The former is handled by using fence instructions at appropriate places, the latter is a non-issue with read (but a potential concurrent write has to ensure proper visibility, which Interlocked functions should do if they map to LOCKED asm instructions).

The crux of this whole discussion is proper alignment, which is devined in Partition I of xxx, in section '12.6.2 Alignment':
Built-in datatypes shall be properly aligned, which is defined as follows:
• 1-byte, 2-byte, and 4-byte data is properly aligned when it is stored at
a 1-byte, 2-byte, or 4-byte boundary, respectively.
• 8-byte data is properly aligned when it is stored on the same boundary
required by the underlying hardware for atomic access to a native int.
Basically, all 32-bit values have the required alignment, and on a 64-bit platform, 64-bit values also have the required alignment.
Note though: There are attributes to explicitly alter the layout of classes in memory, which may cause you to lose this alignment. These are attributes specificially for this purpose though, so unless you have set out to alter the layout, this should not apply to you.
With that out of the way, the purpose of the Interlocked class is to provide operations that (to paraphrase) can only be observed in their 'before' or 'after' state. Interlocked operations are normally only of concern when modifying memory (typically in some non-trivial compare-exchange type way). As the MSDN article you found indicates, read operations (when properly aligned) can be considered atomic at all times without further precautions.
There are however other considerations when dealing with read operations:
On modern CPUs, although the read may be atomic, it also may return the wrong value from a stale cache somewhere... this is where you may need to make the field 'volatile' to get the behaviour you expect
If you are dealing with a 64-bit value on 32-bit hardware, you may need to use the Interlocked.Read operation to guarantee the whole 64-bit value is read in a single atomic operation (otherwise it may be performed as 2 separate 32-bit reads which can be from either side of a memory update)
Re-ordering of your reads / writes may cause you to not get the value you expected; in which case some memory barrier may be needed (either explicit, or through the use of the Interlocked class operations)
Short summary; as far as atomicity goes, it is very likely that what you are doing does not need any special instruction for the read... there may however be other things you need to be careful of, depending on what exactly you are doing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string