Why use an AtomicU32 in Rust, given that U32 already implements Sync? - multithreading

The std::sync::atomic module contains a number of atomic variants of primitive types, with the stated purpose that these types are now thread-safe. However, all the primatives that correspond to the atomic types already implement Send and Sync, and should therefore already be thread-safe. What's the reasoning behind the Atomic types?

Generally, non-atomic integers are safe to share across threads because they're immutable. If you attempt to modify the value, you implicitly create a new one in most cases because they're Copy. However, it isn't safe to share a mutable reference to a u32 across threads (or have both mutable and immutable references to the same value), which practically means that you won't be able to modify the variable and have another thread see the results. An atomic type has some additional behavior which makes it safe.
In the more general case, using non-atomic operations doesn't guarantee that a change made in one thread will be visible in another. Many architectures, especially RISC architectures, do not guarantee that behavior without additional instructions.
In addition, compilers often reorder accesses to memory in functions and in some cases, across functions, and an atomic type with an appropriate barrier is required to indicate to the compiler that such behavior is not wanted.
Finally, atomic operations are often required to logically update the contents of a variable. For example, I may want to atomically add 1 to a variable. On a load-store architecture such as ARM, I cannot modify the contents of memory with an add instruction; I can only perform arithmetic on registers. Consequently, an atomic add is multiple instructions, usually consisting of a load-linked, which loads a memory location, the add operation on the register, and then a store-conditional, which stores the value if the memory location has not changed. There's also a loop to retry if it has.
These are why atomic operations are needed and generally useful across languages. So while one can use non-atomic operations in non-Rust languages, they don't generally produce useful results, and since one typically wants one's code to function correctly, atomic operations are desirable for correctness. Rust's atomic types guarantee this behavior by generating suitable instructions and therefore can be safely shared across threads.

Related

Why are loads required when using atomics

When using atomics in Go (and other languages like c++) its advised to use an atomic load operation for reading a concurrently written value.
If the definition (as I understand it) of an atomic write (be it a store or an integer increment) is that no thread can view a partial write, why is an atomic load required?
Would a plain load of the memory address always be safe from a torn view, if only atomic stores are used on that memory address?
This answer is mainly for C and C++ as I am not directly familiar with atomics in many other languages, but I suspect they are similar.
It's true that many actual machines work this way, in some cases. For instance, on x86-64, ordinary load instructions are atomic with respect to ordinary stores or locked read-modify-write instructions. So for types that can be loaded with a single instruction, you could in principle use ordinary assignment and avoid tearing.
But there are cases where this would not work. For instance:
Types which are not lock-free (e.g. structs of more than a couple words). In this case, several instructions are needed to load or store, and so a lock must be taken around them, or tearing is entirely possible. The atomic load function knows to take the lock, an ordinary assignment wouldn't.
Types which can be lock-free but need special handling. For example, 64-bit long long int on x86-32. An ordinary load would execute two 32-bit integer load instructions (which are individually atomic), and so even if the store is atomic, it could happen in between. But the atomic load function can emit a 64-bit floating point or SIMD load, which is less efficient but does it in one atomic instruction. Example on godbolt.
As such, the language promises atomicity only when the store and load both use the provided atomic functions. - your "definition" is not accurate for C or C++. By requiring the programmer to always use an atomic load, the language provides a "hook" where implementations can take appropriate action if needed. In cases where an ordinary load would suffice, the implementation can optimize accordingly and nothing is lost.
Another point is that the atomic load provides a place to put a memory barrier when one is wanted (any ordering except relaxed). Some architectures include load instructions with a built-in barrier (e.g. ARM64's ldar), and making the barrier part of the load at the language level makes it easier for the compiler to take advantage of this. If you had to do a regular assignment followed by a call to a barrier function, it would be harder for the compiler to figure out that it could optimize them into ldar.

What is a potentially shared memory location?

In C11 and C++11 standard there appear a statement potentially shared memory location. What does this mean? Are all global variables potentially shared in a multithreaded environment?
Not very familiar with the C standard. In C++14, the phrase "potentially shared memory location" appears twice, in two non-normative notes:
[intro.multithread]/25 [ Note: Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. Reordering of atomic loads in cases in which the atomics in question may alias is also generally precluded, since this may violate the coherence rules. — end note ]
[intro.multithread]/26 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]
From context, it's pretty clear that "potentially shared memory location" is supposed to mean "a memory location for which the optimizer cannot rule out the possibility that other threads may access it, and therefore should proceed on the pessimistic assumption that they might." The two notes then discuss the legality of certain optimizations that may or may not be done under such an assumption.
Re: global variables. Yes, a global variable would generally be accessible to arbitrary threads. It is conceivable, in principle, that a sophisticated optimizer performing whole program optimization might be able to prove that a particular global variable is never accessed concurrently from multiple threads (I'm not aware of any actual compilers currently in existence capable of achieving such a feat, but then I wouldn't call myself a compiler expert by any stretch). Barring that, memory locations occupied by a global variable should be treated by an optimizer as potentially shared.

What is the use-case of atomic read

I understand that atomic read serializes the read operations that performed by multiple threads.
What I don't understand is what is the use case?
More interestingly, I've found some implementation of atomic read which is
static inline int32_t ASMAtomicRead32(volatile int32_t *pi32)
{
return *pi32;
}
Where the only distinction to regular read is volatile. Does it mean that atomic read is the same as volatile read?
I understand that atomic read serializes the read operations that performed by multiple threads.
It's rather wrong. How you can ensure the order of reads if there is no write which stores a different value? Even when you have both read and write, it's not necessarily serialized unless correct memory semantics is used in conjunction with both the read & write operations, e.g. 'store-with-release' and 'load-with-acquire'. In your particular example, the memory semantics is relaxed. Though on x86, one can imply acquire semantics for each load and release for each store (unless non-temporal stores are used).
What I don't understand is what is the use case?
atomic reads must ensure that the data is read in one shot and no other thread can store a part of the data in the between. Thus it usually ensures the alignment of the atomic variable (since the read of aligned machine word is atomic) or work-arounds non-aligned cases using more heavy instructions. And finally, it ensures that the read is not optimized out by the compiler nor reordered across other operations in this thread (according to the memory semantics).
Does it mean that atomic read is the same as volatile read?
In a few words, volatile was not intended for such a use-case but sometimes can be abused for it when other requirements are met as well. For your example, my analysis is the following:
int32_t is likely a machine word or less - ok.
usually, everything is aligned at least on 4 bytes boundary, though there is no guarantee in your example
volatile ensures the read is not optimized out
the is no guarantee it will not be reordered either by processor (ok for x86) or by compiler (bad)
Please refer to Arch's blog and Concurrency: Atomic and volatile in C++11 memory model for the details.

Can static arrays be safely accessed from multiple threads?

If each thread is guaranteed to only read/write to a specific subset of the array can multiple threads work on the same (static) array without resorting to critical sections, etc?
EDIT - This is for the specific case of arrays of non-reference-counted types and record/packed-records thereof.
If yes, any caveats?
My gut feeling is yes but my gut can sometimes be an unreliable source of information.
Suppose that:
You have a single instance of an array (static or dynamic), and
The elements of the array are pure value types (i.e. contain no references), and
Each thread operates on disjoint sub-arrays, and
Nothing else in the system writes to the array whilst the threads are operating on it.
With these conditions, which I believe are met by your data structure and threading pattern, then all algorithms are thread-safe.
No, this could not be thread safe, in some situations.
I see at least two reasons.
1. It will depend on the static array content.
If you use some non-reference counted types (like double, integer, bytes, shortstring), there won't be any issue in most case (at least if data is read/only).
But if you use some reference-counted types (like string, interface, or a nested dynamic array), you'll have to take care of thread safety.
That is:
TMyType1: array[0..1] of integer; // thread-safe on reading
TMyType2: array[0..1] of string; // may be confusing
Additional note: if your string is in fact shared among some sub-parts of the static array, you could have the reference count be confused. Unless you explicitly call UniqueString() for each one (inside a critical section, I suspect). For an array of double or integer, you won't have this issue.
2. It will depend on the access concurrency
Read access should be thread safe, even for reference counted type, but concurrent write may be confusing. For a string, you may have GPF issues in some random cases, especially on a multi-core CPU.
Some safe implementation may be:
Use critical sections (smaller as possible, to reduce overhead) or other protection structures;
Use Copy-On-Write or a private per-thread copy of the content, to be sure;
Latest note (not about safety, but performance): Sharing an array among multiple CPUs may lead into performance penalties due to cache synchronization between CPUs. Performance is sometimes much better when you use separated arrays, ensuring their L1 caching window won't be shared among CPUs.
Be aware that such issues may be a nightmare to debug, on client side: multi-thread concurrency issues may occur randomly, and are very difficult to track. The safer, the better, unless you have explicit and proven performance issues.
Additional note: For your specific case of static array of double, with sub-part of the array accessed by one thread only, it is thread-safe. But there is no absolute rule of thread safeness in all situations, even for a static array. As soon as you use some reference-counted types, or some pointers, you may have random issues.

Where is InterlockedRead?

Win32 api has a set of InterlockedXXX functions to atomically and synchronously manipulate simple variables, however there doesn't seem to be any InterlockedRead function, to simply retrive the value of the variable. How come?
MSDN says that:
Simple reads and writes to properly-aligned 32-bit variables are atomic operations
but adds:
However, access is not guaranteed to be synchronized. If two threads are reading and writing from the same variable, you cannot determine if one thread will perform its read operation before the other performs its write operation.
Which means, as I understand it, that a simple read operation of a variable can take place while another, say, InterlockedAdd operation is in place. So why isn't there an interlocked function to read a variable?
I guess the value can be read as the result InterlockedAdd-ing zero, but that doesn't seem the right way to go.
The normal way of implementing this is to use a compare-exchange operation (e.g. InterlockedCompareExchange64) where both values are the same. I have a sneaking suspicion this can be performed more efficiently than an add of 0 for some reason, but I have no evidence to back this up.
Interestingly, .NET's Interlocked class didn't gain a Read method until .NET 2.0. I believe that Interlocked.Read is implemented using Interlocked.CompareExchange. (Note that the documentation for Interlocked.Read strikes me as somewhat misleading - it talks about atomicity, but not volatility, which means something very specific on .NET. I'm not sure what the Win32 memory model guarantees about visibility of newly written values from a different thread, if anything.)
I think that your interpretation of "not synchronized" is wrong. Simple reads are atomic, but you have to take care of reordering and memory visibility issues yourself. The former is handled by using fence instructions at appropriate places, the latter is a non-issue with read (but a potential concurrent write has to ensure proper visibility, which Interlocked functions should do if they map to LOCKED asm instructions).
The crux of this whole discussion is proper alignment, which is devined in Partition I of xxx, in section '12.6.2 Alignment':
Built-in datatypes shall be properly aligned, which is defined as follows:
• 1-byte, 2-byte, and 4-byte data is properly aligned when it is stored at
a 1-byte, 2-byte, or 4-byte boundary, respectively.
• 8-byte data is properly aligned when it is stored on the same boundary
required by the underlying hardware for atomic access to a native int.
Basically, all 32-bit values have the required alignment, and on a 64-bit platform, 64-bit values also have the required alignment.
Note though: There are attributes to explicitly alter the layout of classes in memory, which may cause you to lose this alignment. These are attributes specificially for this purpose though, so unless you have set out to alter the layout, this should not apply to you.
With that out of the way, the purpose of the Interlocked class is to provide operations that (to paraphrase) can only be observed in their 'before' or 'after' state. Interlocked operations are normally only of concern when modifying memory (typically in some non-trivial compare-exchange type way). As the MSDN article you found indicates, read operations (when properly aligned) can be considered atomic at all times without further precautions.
There are however other considerations when dealing with read operations:
On modern CPUs, although the read may be atomic, it also may return the wrong value from a stale cache somewhere... this is where you may need to make the field 'volatile' to get the behaviour you expect
If you are dealing with a 64-bit value on 32-bit hardware, you may need to use the Interlocked.Read operation to guarantee the whole 64-bit value is read in a single atomic operation (otherwise it may be performed as 2 separate 32-bit reads which can be from either side of a memory update)
Re-ordering of your reads / writes may cause you to not get the value you expected; in which case some memory barrier may be needed (either explicit, or through the use of the Interlocked class operations)
Short summary; as far as atomicity goes, it is very likely that what you are doing does not need any special instruction for the read... there may however be other things you need to be careful of, depending on what exactly you are doing.

Resources