What is the use-case of atomic read

What is the use-case of atomic read - multithreading

I understand that atomic read serializes the read operations that performed by multiple threads.
What I don't understand is what is the use case?
More interestingly, I've found some implementation of atomic read which is
static inline int32_t ASMAtomicRead32(volatile int32_t *pi32)
{
return *pi32;
}
Where the only distinction to regular read is volatile. Does it mean that atomic read is the same as volatile read?

I understand that atomic read serializes the read operations that performed by multiple threads.
It's rather wrong. How you can ensure the order of reads if there is no write which stores a different value? Even when you have both read and write, it's not necessarily serialized unless correct memory semantics is used in conjunction with both the read & write operations, e.g. 'store-with-release' and 'load-with-acquire'. In your particular example, the memory semantics is relaxed. Though on x86, one can imply acquire semantics for each load and release for each store (unless non-temporal stores are used).
What I don't understand is what is the use case?
atomic reads must ensure that the data is read in one shot and no other thread can store a part of the data in the between. Thus it usually ensures the alignment of the atomic variable (since the read of aligned machine word is atomic) or work-arounds non-aligned cases using more heavy instructions. And finally, it ensures that the read is not optimized out by the compiler nor reordered across other operations in this thread (according to the memory semantics).
Does it mean that atomic read is the same as volatile read?
In a few words, volatile was not intended for such a use-case but sometimes can be abused for it when other requirements are met as well. For your example, my analysis is the following:
int32_t is likely a machine word or less - ok.
usually, everything is aligned at least on 4 bytes boundary, though there is no guarantee in your example
volatile ensures the read is not optimized out
the is no guarantee it will not be reordered either by processor (ok for x86) or by compiler (bad)
Please refer to Arch's blog and Concurrency: Atomic and volatile in C++11 memory model for the details.

Related

Why are loads required when using atomics

When using atomics in Go (and other languages like c++) its advised to use an atomic load operation for reading a concurrently written value.
If the definition (as I understand it) of an atomic write (be it a store or an integer increment) is that no thread can view a partial write, why is an atomic load required?
Would a plain load of the memory address always be safe from a torn view, if only atomic stores are used on that memory address?

This answer is mainly for C and C++ as I am not directly familiar with atomics in many other languages, but I suspect they are similar.
It's true that many actual machines work this way, in some cases. For instance, on x86-64, ordinary load instructions are atomic with respect to ordinary stores or locked read-modify-write instructions. So for types that can be loaded with a single instruction, you could in principle use ordinary assignment and avoid tearing.
But there are cases where this would not work. For instance:
Types which are not lock-free (e.g. structs of more than a couple words). In this case, several instructions are needed to load or store, and so a lock must be taken around them, or tearing is entirely possible. The atomic load function knows to take the lock, an ordinary assignment wouldn't.
Types which can be lock-free but need special handling. For example, 64-bit long long int on x86-32. An ordinary load would execute two 32-bit integer load instructions (which are individually atomic), and so even if the store is atomic, it could happen in between. But the atomic load function can emit a 64-bit floating point or SIMD load, which is less efficient but does it in one atomic instruction. Example on godbolt.
As such, the language promises atomicity only when the store and load both use the provided atomic functions. - your "definition" is not accurate for C or C++. By requiring the programmer to always use an atomic load, the language provides a "hook" where implementations can take appropriate action if needed. In cases where an ordinary load would suffice, the implementation can optimize accordingly and nothing is lost.
Another point is that the atomic load provides a place to put a memory barrier when one is wanted (any ordering except relaxed). Some architectures include load instructions with a built-in barrier (e.g. ARM64's ldar), and making the barrier part of the load at the language level makes it easier for the compiler to take advantage of this. If you had to do a regular assignment followed by a call to a barrier function, it would be harder for the compiler to figure out that it could optimize them into ldar.

Why use an AtomicU32 in Rust, given that U32 already implements Sync?

The std::sync::atomic module contains a number of atomic variants of primitive types, with the stated purpose that these types are now thread-safe. However, all the primatives that correspond to the atomic types already implement Send and Sync, and should therefore already be thread-safe. What's the reasoning behind the Atomic types?

Generally, non-atomic integers are safe to share across threads because they're immutable. If you attempt to modify the value, you implicitly create a new one in most cases because they're Copy. However, it isn't safe to share a mutable reference to a u32 across threads (or have both mutable and immutable references to the same value), which practically means that you won't be able to modify the variable and have another thread see the results. An atomic type has some additional behavior which makes it safe.
In the more general case, using non-atomic operations doesn't guarantee that a change made in one thread will be visible in another. Many architectures, especially RISC architectures, do not guarantee that behavior without additional instructions.
In addition, compilers often reorder accesses to memory in functions and in some cases, across functions, and an atomic type with an appropriate barrier is required to indicate to the compiler that such behavior is not wanted.
Finally, atomic operations are often required to logically update the contents of a variable. For example, I may want to atomically add 1 to a variable. On a load-store architecture such as ARM, I cannot modify the contents of memory with an add instruction; I can only perform arithmetic on registers. Consequently, an atomic add is multiple instructions, usually consisting of a load-linked, which loads a memory location, the add operation on the register, and then a store-conditional, which stores the value if the memory location has not changed. There's also a loop to retry if it has.
These are why atomic operations are needed and generally useful across languages. So while one can use non-atomic operations in non-Rust languages, they don't generally produce useful results, and since one typically wants one's code to function correctly, atomic operations are desirable for correctness. Rust's atomic types guarantee this behavior by generating suitable instructions and therefore can be safely shared across threads.

Will atomic operations block other threads?

I am trying to make "atomic vs non atomic" concept settled in my mind. My first problem is I could not find "real-life analogy" on that. Like customer/restaurant relationship over atomic operations or something similar.
Also I would like to learn about how atomic operations places themselves in thread-safe programming.
In this blog post; http://preshing.com/20130618/atomic-vs-non-atomic-operations/
it is mentioned as:
An operation acting on shared memory is atomic if it completes in a
single step relative to other threads. When an atomic store is
performed on a shared variable, no other thread can observe the
modification half-complete. When an atomic load is performed on a
shared variable, it reads the entire value as it appeared at a single
moment in time. Non-atomic loads and stores do not make those
guarantees.
What is the meaning of "no other thread can observe the modification half-complete"?
That means thread will wait until atomic operation is done? How that thread know about that operation is atomic? For example in .NET I can understand if you lock the object you set a flag to block other threads. But what about atomic? How other threads know difference between atomic and non-atomic operations?
Also if above statement is true, do all atomic operations are thread-safe?

Let's clarify a bit what is atomic and what are blocks. Atomicity means that operation either executes fully and all it's side effects are visible, or it does not execute at all. So all other threads can either see state before the operation or after it. Block of code guarded by a mutex is atomic too, we just don't call it an operation. Atomic operations are special CPU instructions which conceptually are similar to usual operation guarded by a mutex (you know what mutex is, so I'll use it, despite the fact that it is implemented using atomic operations). CPU has a limited set of operations which it can execute atomically, but due to hardware support they are very fast.
When we discuss thread blocks we usually involve mutexes in conversation because code guarded by them can take quite a time to execute. So we say that thread waits on a mutex. For atomic operations situation is the same, but they are fast and we usually don't care for delays here, so it is not that likely to hear words "block" and "atomic operation" together.
That means thread will wait until atomic operation is done?
Yes it will wait. CPU will restrict access to a block of memory where the variable is located and other CPU cores will wait. Note that for performance reasons that blocks are held only between atomic operations themselves. CPU cores are allowed to cache variables for read.
How that thread know about that operation is atomic?
Special CPU instructions are used. It is just written in your program that particular operation should be performed in atomic manner.
Additional information:
There are more tricky parts with atomic operations. For example on modern CPUs usually all reads and writes of primitive types are atomic. But CPU and compiler are allowed to reorder them. So it is possible that you change some struct, set a flag that telling that it is changed, but CPU reorders writes and sets flag before the struct is actually committed to memory. When you use atomic operations usually some additional efforts are done to prevent undesired reordering. If you want to know more, you should read about memory barriers.
Simple atomic stores and writes are not that useful. To make maximal use of atomic operations you need something more complex. Most common is a CAS - compare and swap. You compare variable with a value and change it only if comparison was successful.

On typical modern CPUs, atomic operations are made atomic this way:
When an instruction is issued that accesses memory, the core's logic attempts to put the core's cache in the correct state to access that memory. Typically, this state will be achieved before the memory access has to happen, so there is no delay.
While another core is performing an atomic operation on a chunk of memory, it locks that memory in its own cache. This prevents any other core from acquiring the right to access that memory until the atomic operation completes.
Unless two cores happen to be performing accesses to many of the same areas of memory and many of those accesses are writes, this typically won't involve any delays at all. That's because the atomic operation is very fast and typically the core knows in advance what memory it will need access to.
So, say a chunk of memory was last accessed on core 1 and now core 2 wants to do an atomic increment. When the core's prefetch logic sees the modification to that memory in the instruction stream, it will direct the cache to acquire that memory. The cache will use the intercore bus to take ownership of that region of memory from core 1's cache and it will lock that region in its own cache.
At this point, if another core tries to read or modify that region of memory, it will be unable to acquire that region in its cache until the lock is released. This communication takes place on the bus that connects the caches and precisely where it takes place depends on which cache(s) the memory was in. (If not in cache at all, then it has to go to main memory.)
A cache lock is not normally described as blocking a thread both because it is so fast and because the core is usually able to do other things while it's trying to acquire the memory region that is locked in the other cache. From the point of view of the higher-level code, the implementation of atomics is typically considered an implementation detail.
All atomic operations provide the guarantee that an intermediate result will not be seen. That's what makes them atomic.

The atomic operations you describe are instructions within the processor and the hardware will make sure that a read cannot happen on a memory location until the atomic write is complete. This guarantees that a thread either reads the value before write or the value after the write operation, but nothing in-between - there's no chance of reading half of the bytes of the value from before the write and the other half from after the write.
Code running against the processor is not even aware of this block but it's really no different from using a lock statement to make sure that a more complex operation (made up of many low-level instructions) is atomic.
A single atomic operation is always thread-safe - the hardware guarantees that the effect of the operation is atomic - it'll never get interrupted in the middle.
A set of atomic operations is not atomic in the vast majority of cases (I'm not an expert so I don't want to make a definitive statement but I can't think of a case where this would be different) - this is why locking is needed for complex operations: the entire operation may be made up of multiple atomic instructions but the whole of the operation may still be interrupted between any of those two instructions, creating the possibility of another thread seeing half-baked results. Locking ensures that code operating on shared data cannot access that data until the other operation completes (possibly over several thread switches).
Some examples are shown in this question / answer but you find many more by searching.

Being "atomic" is an attribute that applies to an operation which is enforced by the implementation (either the hardware or the compiler, generally speaking). For a real-life analogy, look to systems requiring transactions, such as bank accounts. A transfer from one account to another involves a withdrawal from one account and a deposit to another, but generally these should be performed atomically - there is no time when the money has been withdrawn but not yet deposited, or vice versa.
So, continuing the analogy for your question:
What is the meaning of "no other thread can observe the modification half-complete"?
This means that no thread could observe the two accounts in a state where the withdrawal had been made from one account but it had not been deposited in another.
In machine terms, it means that an atomic read of a value in one thread will not see a value with some bits from before an atomic write by another thread, and some bits from after the same write operation. Various operations more complex than just a single read or write can also be atomic: for instance, "compare and swap" is a commonly implemented atomic operation that checks the value of a variable, compares it to a second value, and replaces it with another value if the compared values were equal, atomically - so for instance, if the comparison succeeds, it is not possible for another thread to write a different value in between the compare and the swap parts of the operation. Any write by another thread will either be performed wholly before or wholly after the atomic compare-and-swap.
The title to your question is:
Will atomic operations block other threads?
In the usual meaning of "block", the answer is no; an atomic operation in one thread won't by itself cause execution to stop in another thread, although it may cause a livelock situation or otherwise prevent progress.
That means thread will wait until atomic operation is done?
Conceptually, it means that they will never need to wait. The operation is either done, or not done; it is never halfway done. In practice, atomic operations can be implemented using mutexes, at a significant performance cost. Many (if not most) modern processors support various atomic primitives at the hardware level.
Also if above statement is true, do all atomic operations are thread-safe?
If you compose atomic operations, they are no longer atomic. That is, I can do one atomic compare-and-swap operation followed by another, and the two compare-and-swaps will individually be atomic, but they are divisible. Thus you can still have concurrency errors.

Atomic operation means the system performs an operation in its entirety or not at all. Reading or writing an int64 is atomic (64bits System & 64bits CLR) because the system read/write the 8 bytes in one single operation, readers do not see half of the new value being stored and half of the old value. But be carefull :
long n = 0; // writing 'n' is atomic, 64bits OS & 64bits CLR
long m = n; // reading 'n' is atomic
....// some code
long o = n++; // is not atomic : n = n + 1 is doing a read then a write in 2 separate operations
To make atomicity happens to the n++ you can use the Interlocked API :
long o = Interlocked.Increment(ref n); // other threads are blocked while the atomic operation is running

How do I Understand Read Memory Barriers and Volatile

Some languages provide a volatile modifier that is described as performing a "read memory barrier" prior to reading the memory that backs a variable.
A read memory barrier is commonly described as a way to ensure that the CPU has performed the reads requested before the barrier before it performs a read requested after the barrier. However, using this definition, it would seem that a stale value could still be read. In other words, performing reads in a certain order does not seem to mean that the main memory or other CPUs must be consulted to ensure that subsequent values read actually reflect the latest in the system at the time of the read barrier or written subsequently after the read barrier.
So, does volatile really guarantee that an up-to-date value is read or just (gasp!) that the values that are read are at least as up-to-date as the reads before the barrier? Or some other interpretation? What are the practical implications of this answer?

There are read barriers and write barriers; acquire barriers and release barriers. And more (io vs memory, etc).
The barriers are not there to control "latest" value or "freshness" of the values. They are there to control the relative ordering of memory accesses.
Write barriers control the order of writes. Because writes to memory are slow (compared to the speed of the CPU), there is usually a write-request queue where writes are posted before they 'really happen'. Although they are queued in order, while inside the queue the writes may be reordered. (So maybe 'queue' isn't the best name...) Unless you use write barriers to prevent the reordering.
Read barriers control the order of reads. Because of speculative execution (CPU looks ahead and loads from memory early) and because of the existence of the write buffer (the CPU will read a value from the write buffer instead of memory if it is there - ie the CPU thinks it just wrote X = 5, then why read it back, just see that it is still waiting to become 5 in the write buffer) reads may happen out of order.
This is true regardless of what the compiler tries to do with respect to the order of the generated code. ie 'volatile' in C++ won't help here, because it only tells the compiler to output code to re-read the value from "memory", it does NOT tell the CPU how/where to read it from (ie "memory" is many things at the CPU level).
So read/write barriers put up blocks to prevent reordering in the read/write queues (the read isn't usually so much of a queue, but the reordering effects are the same).
What kinds of blocks? - acquire and/or release blocks.
Acquire - eg read-acquire(x) will add the read of x into the read-queue and flush the queue (not really flush the queue, but add a marker saying don't reorder anything before this read, which is as if the queue was flushed). So later (in code order) reads can be reordered, but not before the read of x.
Release - eg write-release(x, 5) will flush (or marker) the queue first, then add the write-request to the write-queue. So earlier writes won't become reordered to happen after x = 5, but note that later writes can be reordered before x = 5.
Note that I paired the read with acquire and write with release because this is typical, but different combinations are possible.
Acquire and Release are considered 'half-barriers' or 'half-fences' because they only stop the reordering from going one way.
A full barrier (or full fence) applies both an acquire and a release - ie no reordering.
Typically for lockfree programming, or C# or java 'volatile', what you want/need is
read-acquire and write-release.
ie
void threadA()
{
foo->x = 10;
foo->y = 11;
foo->z = 12;
write_release(foo->ready, true);
bar = 13;
}
void threadB()
{
w = some_global;
ready = read_acquire(foo->ready);
if (ready)
{
q = w * foo->x * foo->y * foo->z;
}
else
calculate_pi();
}
So, first of all, this is a bad way to program threads. Locks would be safer. But just to illustrate barriers...
After threadA() is done writing foo, it needs to write foo->ready LAST, really last, else other threads might see foo->ready early and get the wrong values of x/y/z. So we use a write_release on foo->ready, which, as mentioned above, effectively 'flushes' the write queue (ensuring x,y,z are committed) then adds the ready=true request to the queue. And then adds the bar=13 request. Note that since we just used a release barrier (not a full) bar=13 may get written before ready. But we don't care! ie we are assuming bar is not changing shared data.
Now threadB() needs to know that when we say 'ready' we really mean ready. So we do a read_acquire(foo->ready). This read is added to the read queue, THEN the queue is flushed. Note that w = some_global may also still be in the queue. So foo->ready may be read before some_global. But again, we don't care, as it is not part of the important data that we are being so careful about.
What we do care about is foo->x/y/z. So they are added to the read queue after the acquire flush/marker, guaranteeing that they are read only after reading foo->ready.
Note also, that this is typically the exact same barriers used for locking and unlocking a mutex/CriticalSection/etc. (ie acquire on lock(), release on unlock() ).
So,
I'm pretty sure this (ie acquire/release) is exactly what MS docs say happens for read/writes of 'volatile' variables in C# (and optionally for MS C++, but this is non-standard). See http://msdn.microsoft.com/en-us/library/aa645755(VS.71).aspx including "A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it..."
I think java is the same, although I'm not as familiar. I suspect it is exactly the same, because you just don't typically need more guarantees than read-acquire/write-release.
In your question you were on the right track when thinking that it is really all about relative order - you just had the orderings backwards (ie "the values that are read are at least as up-to-date as the reads before the barrier? " - no, reads before the barrier are unimportant, its reads AFTER the barrier that are guaranteed to be AFTER, vice versa for writes).
And please note, as mentioned, reordering happens on both reads and writes, so only using a barrier on one thread and not the other WILL NOT WORK. ie a write-release isn't enough without the read-acquire. ie even if you write it in the right order, it could be read in the wrong order if you didn't use the read barriers to go with the write barriers.
And lastly, note that lock-free programming and CPU memory architectures can be actually much more complicated than that, but sticking with acquire/release will get you pretty far.

volatile in most programming languages does not imply a real CPU read memory barrier but an order to the compiler not to optimize the reads via caching in a register. This means that the reading process/thread will get the value "eventually". A common technique is to declare a boolean volatile flag to be set in a signal handler and checked in the main program loop.
In contrast CPU memory barriers are directly provided either via CPU instructions or implied with certain assembler mnemonics (such as lock prefix in x86) and are used for example when talking to hardware devices where order of reads and writes to memory-mapped IO registers is important or synchronizing memory access in multi-processing environment.
To answer your question - no, memory barrier does not guarantee "latest" value, but guarantees order of memory access operations. This is crucial for example in lock-free programming.
Here is one of the primers on CPU memory barriers.

Where is InterlockedRead?

Win32 api has a set of InterlockedXXX functions to atomically and synchronously manipulate simple variables, however there doesn't seem to be any InterlockedRead function, to simply retrive the value of the variable. How come?
MSDN says that:
Simple reads and writes to properly-aligned 32-bit variables are atomic operations
but adds:
However, access is not guaranteed to be synchronized. If two threads are reading and writing from the same variable, you cannot determine if one thread will perform its read operation before the other performs its write operation.
Which means, as I understand it, that a simple read operation of a variable can take place while another, say, InterlockedAdd operation is in place. So why isn't there an interlocked function to read a variable?
I guess the value can be read as the result InterlockedAdd-ing zero, but that doesn't seem the right way to go.

The normal way of implementing this is to use a compare-exchange operation (e.g. InterlockedCompareExchange64) where both values are the same. I have a sneaking suspicion this can be performed more efficiently than an add of 0 for some reason, but I have no evidence to back this up.
Interestingly, .NET's Interlocked class didn't gain a Read method until .NET 2.0. I believe that Interlocked.Read is implemented using Interlocked.CompareExchange. (Note that the documentation for Interlocked.Read strikes me as somewhat misleading - it talks about atomicity, but not volatility, which means something very specific on .NET. I'm not sure what the Win32 memory model guarantees about visibility of newly written values from a different thread, if anything.)

I think that your interpretation of "not synchronized" is wrong. Simple reads are atomic, but you have to take care of reordering and memory visibility issues yourself. The former is handled by using fence instructions at appropriate places, the latter is a non-issue with read (but a potential concurrent write has to ensure proper visibility, which Interlocked functions should do if they map to LOCKED asm instructions).

The crux of this whole discussion is proper alignment, which is devined in Partition I of xxx, in section '12.6.2 Alignment':
Built-in datatypes shall be properly aligned, which is defined as follows:
• 1-byte, 2-byte, and 4-byte data is properly aligned when it is stored at
a 1-byte, 2-byte, or 4-byte boundary, respectively.
• 8-byte data is properly aligned when it is stored on the same boundary
required by the underlying hardware for atomic access to a native int.
Basically, all 32-bit values have the required alignment, and on a 64-bit platform, 64-bit values also have the required alignment.
Note though: There are attributes to explicitly alter the layout of classes in memory, which may cause you to lose this alignment. These are attributes specificially for this purpose though, so unless you have set out to alter the layout, this should not apply to you.
With that out of the way, the purpose of the Interlocked class is to provide operations that (to paraphrase) can only be observed in their 'before' or 'after' state. Interlocked operations are normally only of concern when modifying memory (typically in some non-trivial compare-exchange type way). As the MSDN article you found indicates, read operations (when properly aligned) can be considered atomic at all times without further precautions.
There are however other considerations when dealing with read operations:
On modern CPUs, although the read may be atomic, it also may return the wrong value from a stale cache somewhere... this is where you may need to make the field 'volatile' to get the behaviour you expect
If you are dealing with a 64-bit value on 32-bit hardware, you may need to use the Interlocked.Read operation to guarantee the whole 64-bit value is read in a single atomic operation (otherwise it may be performed as 2 separate 32-bit reads which can be from either side of a memory update)
Re-ordering of your reads / writes may cause you to not get the value you expected; in which case some memory barrier may be needed (either explicit, or through the use of the Interlocked class operations)
Short summary; as far as atomicity goes, it is very likely that what you are doing does not need any special instruction for the read... there may however be other things you need to be careful of, depending on what exactly you are doing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string