As far as I know they aren't.
Atomic objects are free of data races, yet they can still suffer from race conditions: two threads might start in an unpredictable order making the program outcome non-deterministic.
Shared data would be "safe" (protected by atomics) but the sequence or timing could still be wrong.
Can you confirm this?
Yes, you are correct that non-atomic operations may still have race condition. If you have non-atomic operations that depend on the state of the atomic object without interference from other threads, you need to use another synchronization technique to maintain consistency.
Atomic operations on the atomic object will be consistent, but not race-free. Non-atomic operations using the atomic object are not race-free.
Not just atomic objects, any primitive that can be used with operations performed by threads running concurrently:
mutex
condition variable
semaphore
barrier
atomic objects...
by definition are only useful if there is a race, an unpredictability in the access pattern. If the accesses where well ordered in a predictable way, you would have used a regular mutable object in the programming language.
But even if the order is a priori unknown, the end result can be deterministic: consider the concurrently running threads serving pages for a static Web server, with a count of pages and bytes served as the only mutable data structure. The statistics can be kept in a data structure protected by a mutex (a mutex isn't needed, it's just a simple example): the order of mutex locking is unpredictable, but the end result is that the data structure contains the sum of pages and bytes served; it doesn't matter in which order each thread adds the counts to the shared data.
Related
I have a shared data structure that is read in one thread and modified in another thread. However, its data changes very occasionally. Most of time, it is read by the thread. I now have a Mutex (or RW lock) locked before read/write and unlocked after read/write.
Because the data rarely changes, lock-unlock every time it is read seems inefficient. If no change is made to the data, I can get rid of the lock because only read to the same structure can run simultaneously without lock.
My question is:
Is there a lock-free solution that allows me changes the data without a lock?
Or, the lock-unlock in read (one thread, in other words, no contention) don't take much of time/resources (no enter to the kernel) at all?
If there's no contention, not kernel call is needed, but still atomic lock acquisition is needed. If the resource is occupied for a short period of time, then spinning can be attempted before kernel call.
Mutex and RW lock implementations, such as (an usual quality implementation of) std::mutex / std::shared_mutex in C++ or CRITICAL_SECTION / SRW_LOCK in Windows already employ above mentioned techniques on their own. Linux mutexes are usually based on futex, so they also avoid kernel call when it its not needed. So you don't need to bother about saving a kernel call yourself.
And there are alternatives to locking. There are atomic types that can be accessed using lock-free reads and writes, they can always avoid lock. There are other patterns, such as SeqLock. There is transaction memory.
But before going there, you should make sure that locking is performance problem. Because use of atomics may be not simple (although it is simple for some languages and simple cases), and other alternatives have their own pitfalls.
An uncontrolled data race may be dangerous. Maybe not. And there may be very thin boundary between cases where it is and where it is not. For example, copying a bunch of integer could only result in garbage integers occasionally obtained, if integers are properly sized and aligned, then there may be only a mix up, but not garbage value of a single integer, and if you add some more complex type, say string, you may have a crash. So most of the times uncontrolled data race is treated as Undefined Behavior.
In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.
Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.
isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.
I am trying to make "atomic vs non atomic" concept settled in my mind. My first problem is I could not find "real-life analogy" on that. Like customer/restaurant relationship over atomic operations or something similar.
Also I would like to learn about how atomic operations places themselves in thread-safe programming.
In this blog post; http://preshing.com/20130618/atomic-vs-non-atomic-operations/
it is mentioned as:
An operation acting on shared memory is atomic if it completes in a
single step relative to other threads. When an atomic store is
performed on a shared variable, no other thread can observe the
modification half-complete. When an atomic load is performed on a
shared variable, it reads the entire value as it appeared at a single
moment in time. Non-atomic loads and stores do not make those
guarantees.
What is the meaning of "no other thread can observe the modification half-complete"?
That means thread will wait until atomic operation is done? How that thread know about that operation is atomic? For example in .NET I can understand if you lock the object you set a flag to block other threads. But what about atomic? How other threads know difference between atomic and non-atomic operations?
Also if above statement is true, do all atomic operations are thread-safe?
Let's clarify a bit what is atomic and what are blocks. Atomicity means that operation either executes fully and all it's side effects are visible, or it does not execute at all. So all other threads can either see state before the operation or after it. Block of code guarded by a mutex is atomic too, we just don't call it an operation. Atomic operations are special CPU instructions which conceptually are similar to usual operation guarded by a mutex (you know what mutex is, so I'll use it, despite the fact that it is implemented using atomic operations). CPU has a limited set of operations which it can execute atomically, but due to hardware support they are very fast.
When we discuss thread blocks we usually involve mutexes in conversation because code guarded by them can take quite a time to execute. So we say that thread waits on a mutex. For atomic operations situation is the same, but they are fast and we usually don't care for delays here, so it is not that likely to hear words "block" and "atomic operation" together.
That means thread will wait until atomic operation is done?
Yes it will wait. CPU will restrict access to a block of memory where the variable is located and other CPU cores will wait. Note that for performance reasons that blocks are held only between atomic operations themselves. CPU cores are allowed to cache variables for read.
How that thread know about that operation is atomic?
Special CPU instructions are used. It is just written in your program that particular operation should be performed in atomic manner.
Additional information:
There are more tricky parts with atomic operations. For example on modern CPUs usually all reads and writes of primitive types are atomic. But CPU and compiler are allowed to reorder them. So it is possible that you change some struct, set a flag that telling that it is changed, but CPU reorders writes and sets flag before the struct is actually committed to memory. When you use atomic operations usually some additional efforts are done to prevent undesired reordering. If you want to know more, you should read about memory barriers.
Simple atomic stores and writes are not that useful. To make maximal use of atomic operations you need something more complex. Most common is a CAS - compare and swap. You compare variable with a value and change it only if comparison was successful.
On typical modern CPUs, atomic operations are made atomic this way:
When an instruction is issued that accesses memory, the core's logic attempts to put the core's cache in the correct state to access that memory. Typically, this state will be achieved before the memory access has to happen, so there is no delay.
While another core is performing an atomic operation on a chunk of memory, it locks that memory in its own cache. This prevents any other core from acquiring the right to access that memory until the atomic operation completes.
Unless two cores happen to be performing accesses to many of the same areas of memory and many of those accesses are writes, this typically won't involve any delays at all. That's because the atomic operation is very fast and typically the core knows in advance what memory it will need access to.
So, say a chunk of memory was last accessed on core 1 and now core 2 wants to do an atomic increment. When the core's prefetch logic sees the modification to that memory in the instruction stream, it will direct the cache to acquire that memory. The cache will use the intercore bus to take ownership of that region of memory from core 1's cache and it will lock that region in its own cache.
At this point, if another core tries to read or modify that region of memory, it will be unable to acquire that region in its cache until the lock is released. This communication takes place on the bus that connects the caches and precisely where it takes place depends on which cache(s) the memory was in. (If not in cache at all, then it has to go to main memory.)
A cache lock is not normally described as blocking a thread both because it is so fast and because the core is usually able to do other things while it's trying to acquire the memory region that is locked in the other cache. From the point of view of the higher-level code, the implementation of atomics is typically considered an implementation detail.
All atomic operations provide the guarantee that an intermediate result will not be seen. That's what makes them atomic.
The atomic operations you describe are instructions within the processor and the hardware will make sure that a read cannot happen on a memory location until the atomic write is complete. This guarantees that a thread either reads the value before write or the value after the write operation, but nothing in-between - there's no chance of reading half of the bytes of the value from before the write and the other half from after the write.
Code running against the processor is not even aware of this block but it's really no different from using a lock statement to make sure that a more complex operation (made up of many low-level instructions) is atomic.
A single atomic operation is always thread-safe - the hardware guarantees that the effect of the operation is atomic - it'll never get interrupted in the middle.
A set of atomic operations is not atomic in the vast majority of cases (I'm not an expert so I don't want to make a definitive statement but I can't think of a case where this would be different) - this is why locking is needed for complex operations: the entire operation may be made up of multiple atomic instructions but the whole of the operation may still be interrupted between any of those two instructions, creating the possibility of another thread seeing half-baked results. Locking ensures that code operating on shared data cannot access that data until the other operation completes (possibly over several thread switches).
Some examples are shown in this question / answer but you find many more by searching.
Being "atomic" is an attribute that applies to an operation which is enforced by the implementation (either the hardware or the compiler, generally speaking). For a real-life analogy, look to systems requiring transactions, such as bank accounts. A transfer from one account to another involves a withdrawal from one account and a deposit to another, but generally these should be performed atomically - there is no time when the money has been withdrawn but not yet deposited, or vice versa.
So, continuing the analogy for your question:
What is the meaning of "no other thread can observe the modification half-complete"?
This means that no thread could observe the two accounts in a state where the withdrawal had been made from one account but it had not been deposited in another.
In machine terms, it means that an atomic read of a value in one thread will not see a value with some bits from before an atomic write by another thread, and some bits from after the same write operation. Various operations more complex than just a single read or write can also be atomic: for instance, "compare and swap" is a commonly implemented atomic operation that checks the value of a variable, compares it to a second value, and replaces it with another value if the compared values were equal, atomically - so for instance, if the comparison succeeds, it is not possible for another thread to write a different value in between the compare and the swap parts of the operation. Any write by another thread will either be performed wholly before or wholly after the atomic compare-and-swap.
The title to your question is:
Will atomic operations block other threads?
In the usual meaning of "block", the answer is no; an atomic operation in one thread won't by itself cause execution to stop in another thread, although it may cause a livelock situation or otherwise prevent progress.
That means thread will wait until atomic operation is done?
Conceptually, it means that they will never need to wait. The operation is either done, or not done; it is never halfway done. In practice, atomic operations can be implemented using mutexes, at a significant performance cost. Many (if not most) modern processors support various atomic primitives at the hardware level.
Also if above statement is true, do all atomic operations are thread-safe?
If you compose atomic operations, they are no longer atomic. That is, I can do one atomic compare-and-swap operation followed by another, and the two compare-and-swaps will individually be atomic, but they are divisible. Thus you can still have concurrency errors.
Atomic operation means the system performs an operation in its entirety or not at all. Reading or writing an int64 is atomic (64bits System & 64bits CLR) because the system read/write the 8 bytes in one single operation, readers do not see half of the new value being stored and half of the old value. But be carefull :
long n = 0; // writing 'n' is atomic, 64bits OS & 64bits CLR
long m = n; // reading 'n' is atomic
....// some code
long o = n++; // is not atomic : n = n + 1 is doing a read then a write in 2 separate operations
To make atomicity happens to the n++ you can use the Interlocked API :
long o = Interlocked.Increment(ref n); // other threads are blocked while the atomic operation is running
I often find these terms being used in context of concurrent programming . Are they the same thing or different ?
No, they are not the same thing. They are not a subset of one another. They are also neither the necessary, nor the sufficient condition for one another.
The definition of a data race is pretty clear, and therefore, its discovery can be automated. A data race occurs when 2 instructions from different threads access the same memory location, at least one of these accesses is a write and there is no synchronization that is mandating any particular order among these accesses.
A race condition is a semantic error. It is a flaw that occurs in the timing or the ordering of events that leads to erroneous program behavior. Many race conditions can be caused by data races, but this is not necessary.
Consider the following simple example where x is a shared variable:
Thread 1 Thread 2
lock(l) lock(l)
x=1 x=2
unlock(l) unlock(l)
In this example, the writes to x from thread 1 and 2 are protected by locks, therefore they are always happening in some order enforced by the order with which the locks are acquired at runtime. That is, the writes' atomicity cannot be broken; there is always a happens before relationship between the two writes in any execution. We just cannot know which write happens before the other a priori.
There is no fixed ordering between the writes, because locks cannot provide this. If the programs' correctness is compromised, say when the write to x by thread 2 is followed by the write to x in thread 1, we say there is a race condition, although technically there is no data race.
It is far more useful to detect race conditions than data races; however this is also very difficult to achieve.
Constructing the reverse example is also trivial. This blog post also explains the difference very well, with a simple bank transaction example.
According to Wikipedia, the term "race condition" has been in use since the days of the first electronic logic gates. In the context of Java, a race condition can pertain to any resource, such as a file, network connection, a thread from a thread pool, etc.
The term "data race" is best reserved for its specific meaning defined by the JLS.
The most interesting case is a race condition that is very similar to a data race, but still isn't one, like in this simple example:
class Race {
static volatile int i;
static int uniqueInt() { return i++; }
}
Since i is volatile, there is no data race; however, from the program correctness standpoint there is a race condition due to the non-atomicity of the two operations: read i, write i+1. Multiple threads may receive the same value from uniqueInt.
TL;DR: The distinction between data race and race condition depends on the nature of problem formulation, and where to draw the boundary between undefined behavior and well-defined but indeterminate behavior. The current distinction is conventional and best reflects the interface between processor architect and programming language.
1. Semantics
Data race specifically refers to the non-synchronized conflicting "memory accesses" (or actions, or operations) to the same memory location. If there is no conflict in the memory accesses, while there is still indeterminate behavior caused by operation ordering, that is a race condition.
Note "memory accesses" here have specific meaning. They refer to the "pure" memory load or store actions, without any additional semantics applied. For example, a memory store from one thread does not (necessarily) know how long it takes for the data to be written into the memory, and finally propagates to another thread. For another example, a memory store to one location before another store to another location by the same thread does not (necessarily) guarantee the first data written in the memory be ahead of the second. As a result, the order of those pure memory accesses are not (necessarily) able to be "reasoned" , and anything could happen, unless otherwise well defined.
When the "memory accesses" are well defined in terms of ordering through synchronization, additional semantics can ensure that, even if the timing of the memory accesses are indeterminate, their order can be "reasoned" through the synchronizations. Note, although the ordering between the memory accesses can be reasoned, they are not necessarily determinate, hence the race condition.
2. Why the difference?
But if the order is still indeterminate in race condition, why bother to distinguish it from data race? The reason is in practical rather than theoretical. It is because the distinction does exist in the interface between the programming language and processor architecture.
A memory load/store instruction in modern architecture is usually implemented as "pure" memory access, due to the nature of out-of-order pipeline, speculation, multi-level of cache, cpu-ram interconnection, especially multi-core, etc. There are lots of factors leading to indeterminate timing and ordering. To enforce ordering for every memory instruction incurs huge penalty, especially in a processor design that supports multi-core. So the ordering semantics are provided with additional instructions like various barriers (or fences).
Data race is the situation of processor instruction execution without additional fences to help reasoning the ordering of conflicting memory accesses. The result is not only indeterminate, but also possibly very weird, e.g., two writes to the same word location by different threads may result with each writing half of the word, or may only operate upon their locally cached values. -- These are undefined behavior, from the programmer's point of view. But they are (usually) well defined from the processor architect's point of view.
Programmers have to have a way to reason their code execution. Data race is something they cannot make sense, therefore should always avoid (normally). That is why the language specifications that are low level enough usually define data race as undefined behavior, different from the well-defined memory behavior of race condition.
3. Language memory models
Different processors may have different memory access behavior, i.e., processor memory model. It is awkward for programmers to study the memory model of every modern processor and then develop programs that can benefit from them. It is desirable if the language can define a memory model so that the programs of that language always behave as expected as the memory model defines. That is why Java and C++ have their memory models defined. It is the burden of the compiler/runtime developers to ensure the language memory models are enforced across different processor architectures.
That said, if a language does not want to expose the low level behavior of the processor (and is willing to sacrifice certain performance benefits of the modern architectures), they can choose to define a memory model that completely hide the details of "pure" memory accesses, but apply ordering semantics for all their memory operations. Then the compiler/runtime developers may choose to treat every memory variable as volatile in all processor architectures. For these languages (that support shared memory across threads), there are no data races, but may still be race conditions, even with a language of complete sequential consistence.
On the other hand, the processor memory model can be stricter (or less relaxed, or at higher level), e.g., implementing sequential consistency as early-days processor did. Then all memory operations are ordered, and no data race exists for any languages running in the processor.
4. Conclusion
Back to the original question, IMHO it is fine to define data race as a special case of race condition, and race condition at one level may become data race at a higher level. It depends on the nature of problem formulation, and where to draw the boundary between undefined behavior and well-defined but indeterminate behavior. Just the current convention defines the boundary at language-processor interface, does not necessarily mean that is always and must be the case; but the current convention probably best reflects the state-of-the-art interface (and wisdom) between processor architect and programming language.
No, they are different & neither of them is a subset of one or vice-versa.
The term race condition is often confused with the related term data
race, which arises when synchronization is not used to coordinate all
access to a shared nonfinal field. You risk a data race whenever a
thread writes a variable that might next be read by another thread or
reads a variable that might have last been written by another thread
if both threads do not use synchronization; code with data races has
no useful defined semantics under the Java Memory Model. Not all race
conditions are data races, and not all data races are race conditions,
but they both can cause concurrent programs to fail in unpredictable
ways.
Taken from the excellent book - Java Concurrency in Practice by Brian Goetz & Co.
Data races and Race condition
[Atomicity, Visibility, Ordering]
In my opinion definitely it is two different things.
Data races is a situation when same memory is shared between several threads(at least one of them change it (write access)) without synchronoization
Race condition is a situation when not synchronized blocks of code(may be the same) which use same shared resource are run simultaneously on different threads and result of which is unpredictable.
Race condition examples:
//increment variable
1. read variable
2. change variable
3. write variable
//cache mechanism
1. check if exists in cache and if not
2. load
3. cache
Solution:
Data races and Race condition are problem with atomicity and they can be solved by synchronization mechanism.
Data races - When write access to shared variable will be synchronized
Race condition - When block of code is run as an atomic operation
I'm looking for real world examples of needing read and write access to the same value in concurrent systems.
In my opinion, many semaphores or locks are present because there's no known alternative (to the implementer,) but do you know of any patterns where mutexes seem to be a requirement?
In a way I'm asking for candidates for the standard set of HARD problems for concurrent software in the real world.
What kind of locks are used depends on how the data is being accessed by multiple threads. If you can fine tune the use case, you can sometimes eliminate the need for exclusive locks completely.
An exclusive lock is needed only if your use case requires that the shared data must be 100% exact all the time. This is the default that most developers start with because that's how we think about data normally.
However, if what you are using the data for can tolerate some "looseness", there are several techniques to share data between threads without the use of exclusive locks on every access.
For example, if you have a linked list of data and if your use of that linked list would not be upset by seeing the same node multiple times in a list traversal and would not be upset if it did not see an insert immediately after the insert (or similar artifacts), you can perform list inserts and deletes using atomic pointer exchange without the need for a full-stop mutex lock around the insert or delete operation.
Another example: if you have an array or list object that is mostly read from by threads and only occasionally updated by a master thread, you could implement lock-free updates by maintaining two copies of the list: one that is "live" that other threads can read from and another that is "offline" that you can write to in the privacy of your own thread. To perform an update, you copy the contents of the "live" list into the "offline" list, perform the update to the offline list, and then swap the offline list pointer into the live list pointer using an atomic pointer exchange. You will then need some mechanism to let the readers "drain" from the now offline list. In a garbage collected system, you can just release the reference to the offline list - when the last consumer is finished with it, it will be GC'd. In a non-GC system, you could use reference counting to keep track of how many readers are still using the list. For this example, having only one thread designated as the list updater would be ideal. If multiple updaters are needed, you will need to put a lock around the update operation, but only to serialize updaters - no lock and no performance impact on readers of the list.
All the lock-free resource sharing techniques I'm aware of require the use of atomic swaps (aka InterlockedExchange). This usually translates into a specific instruction in the CPU and/or a hardware bus lock (lock prefix on a read or write opcode in x86 assembler) for a very brief period of time. On multiproc systems, atomic swaps may force a cache invalidation on the other processors (this was the case on dual proc Pentium II) but I don't think this is as much of a problem on current multicore chips. Even with these performance caveats, lock-free runs much faster than taking a full-stop kernel event object. Just making a call into a kernel API function takes several hundred clock cycles (to switch to kernel mode).
Examples of real-world scenarios:
producer/consumer workflows. Web service receives http requests for data, places the request into an internal queue, worker thread pulls the work item from the queue and performs the work. The queue is read/write and has to be thread safe.
Data shared between threads with change of ownership. Thread 1 allocates an object, tosses it to thread 2 for processing, and never wants to see it again. Thread 2 is responsible for disposing the object. The memory management system (malloc/free) must be thread safe.
File system. This is almost always an OS service and already fully thread safe, but it's worth including in the list.
Reference counting. Releases the resource when the number of references drops to zero. The increment/decrement/test operations must be thread safe. These can usually be implemented using atomic primitives instead of full-stop kernal mutex locks.
Most real world, concurrent software, has some form of requirement for synchronization at some level. Often, better written software will take great pains to reduce the amount of locking required, but it is still required at some point.
For example, I often do simulations where we have some form of aggregation operation occurring. Typically, there are ways to prevent locking during the simulation phase itself (ie: use of thread local state data, etc), but the actual aggregation portion typically requires some form of lock at the end.
Luckily, this becomes a lock per thread, not per unit of work. In my case, this is significant, since I'm typically doing operations on hundreds of thousands or millions of units of work, but most of the time, it's occuring on systems with 4-16 PEs, which means I'm usually restricting to a similar number of units of execution. By using this type of mechanism, you're still locking, but you're locking between tens of elements instead of potentially millions.