Memory Barrer and Visibility - x64

Memory Barrer and Visibility - x64 - multithreading

I have read Intel document about memory orderings on x64: http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf .They says that locked instructions cause full barriers which makes processors to see e.g. updates in specified order. But there is nothing about visibility caused by barriers. Does barriers cause that other processors will see updates of variables immediately or maybe updates will propagate to other processors only in specified order but with not specified time?
E.g.
Thread1:
flag = true;
MemoryBarrier();
Thread 2:
MemoryBarrier();
tmp = flag;
Does thread 2 will always flag=true if Thread 1 will execute its code before Thread 2?

The barriers guarantee that other processors will see updates in the specified order, but not when that happens.
Which brings the follow-up question, how do you define "immediately" in a multiprocessor system [1], or how do you ensure that Thread 1 executes before Thread 2? In this case, one answer would be that Thread 1 uses an atomic instruction such as xchg to do the store to the flag variable, and then Thread 2 spins on the flag, and proceeds when it notices that the value changes (due to the way the x86 memory model works, Thread 2 can spin using normal load instructions, it is sufficient that the store is done with an atomic)
[1] One can think of it in terms of relativistic physics, each observer (thread) sees events through its own "light cone". Hence one must abandon concepts such as a single universal time for all observers.

Related

What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

Code within a single thread has certain memory guarantees, such as read after write (i.e. writing some value to a memory location, then reading it back should give the value you wrote).
What happens to such memory guarantees if a thread is rescheduled to execute on a different CPU core? Say a thread writes 10 to memory location X, then gets rescheduled to a different core. That core's L1 cache might have a different value for X (from another thread that was executing on that core previously), so now a read of X wouldn't return 10 as the thread expects. Is there some L1 cache synchronization that occurs when a thread is scheduled on a different core?

All that is required in this case is that the writes performed while on the first processor become globally visible before the process begins executing on the second processor. In the Intel 64 architecture this is accomplished by including one or more instructions with memory fence semantics in the code that the OS uses to transfer the process from one core to another. An example from the Linux kernel:
/*
* Make previous memory operations globally visible before
* sending the IPI through x2apic wrmsr. We need a serializing instruction or
* mfence for this.
*/
static inline void x2apic_wrmsr_fence(void)
{
asm volatile("mfence" : : : "memory");
}
This ensures that the stores from the original core are globally visible before execution of the inter-processor interrupt that will start the thread running on the new core.
Reference: Sections 8.2 and 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-071, October 2019).

TL;DR: It depends on the architecture and the OS. On x86, this type of read-after-write hazard is mostly not issue that has to be considered at the software level, except for the weakly-order WC stores which require a store fence to be executed in software on the same logical core before the thread is migrated.
Usually the thread migration operation includes at least one memory store. Consider an architecture with the following property:
The memory model is such that memory stores may not become globally observable in program order. This Wikipedia article has a not-accurate-but-good-enough table that shows examples of architectures that have this property (see the row "Stores can be reordered after stores ").
The ordering hazard you mentioned may be possible on such an architecture because even if the thread migration operation completes, it doesn't necessarily mean that all the stores that the thread has performed are globally observable. On architectures with strict sequential store ordering, this hazard cannot occur.
On a completely hypothetical architecture where it's possible to migrate a thread without doing a single memory store (e.g., by directly transferring the thread's context to another core), the hazard can occur even if all stores are sequential on an architecture with the following property:
There is a "window of vulnerability" between the time when a store retires and when it becomes globally observable. This can happen, for example, due to the presence of store buffers and/or MSHRs. Most modern processors have this property.
So even with sequential store ordering, it may be possible that the thread running on the new core may not see the last N stores.
Note that on an machine with in-order retirement, the window of vulnerability is a necessary but insufficient condition for a memory model that supports stores that may not be sequential.
Usually a thread is rescheduled to run on a different core using one of the following two methods:
A hardware interrupt, such as a timer interrupt, occurs that ultimately causes the thread to be rescheduled on a different logical core.
The thread itself performs a system call, such as sched_setaffinity, that ultimately causes it to run on a different core.
The question is at which point does the system guarantee that retired stores become globally observable? On Intel and AMD x86 processors, hardware interrupts are fully serializing events, so all user-mode stores (including cacheable and uncacheable) are guaranteed to be globally observable before the interrupt handler is executed, in which the thread may be rescheduled to run a different logical core.
On Intel and AMD x86 processors, there are multiple ways to perform system calls (i.e., change the privilege level) including INT, SYSCALL, SYSENTER, and far CALL. None of them guarantee that all previous stores become globally observable. Therefore, the OS is supposed to do this explicitly when scheduling a thread on a different core by executing a store fence operation. This is done as part of saving the thread context (architectural user-mode registers) to memory and adding the thread to the queue associated with the other core. These operations involve at least one store that is subject to the sequential ordering guarantee. When the scheduler runs on the target core, it would see the full register and memory architectural state (at the point of the last retired instruction) of the thread would be available on that core.
On x86, if the thread uses stores of type WC, which do not guarantee the sequential ordering, the OS may not guarantee in this case that it will make these stores globally observable. The x86 spec explicitly states that in order to make WC stores globally observable, a store fence has to be used (either in the thread on the same core or, much simpler, in the OS). An OS generally should do this, as mentioned in #JohnDMcCalpin's answer. Otherwise, if the OS doesn't provide the program order guarantee to software threads, then the user-mode programmer may need to take this into account. One way would be the following:
Save a copy of the current CPU mask and pin the thread to the current core (or any single core).
Execute the weakly-ordered stores.
Execute a store fence.
Restore the CPU mask.
This temporarily disables migration to ensure that the store fence is executed on the same core as the weakly-ordered stores. After executing the store fence, the thread can safely migrate without possibly violating program order.
Note that user-mode sleep instructions, such as UMWAIT, cannot cause the thread to be rescheduled on a different core because the OS does not take control in this case.
Thread Migration in the Linux Kernel
The code snippet from #JohnDMcCalpin's answer falls on the path to send an inter-processor interrupt, which is achieved using a WRMSR instruction to an APIC register. An IPI may be sent for many reasons. For example, to perform a TLB shootdown operation. In this case, it's important to ensure that the updated paging structures are globally observable before invaliding the TLB entries on the other cores. That's why x2apic_wrmsr_fence may be needed, which is invoked just before sending an IPI.
That said, I don't think thread migration requires sending an IPI. Essentially, a thread is migrated by removing it from some data structure that is associated with one core and add it to the one associated with the target core. A thread may be migrated for numerous reasons, such as when the affinity changes or when the scheduler decides to rebalance the load. As mentioned in the Linux source code, all paths of thread migration in the source code end up executing the following:
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg)
where arg holds the task to be migrated and the destination core identifier. migration_cpu_stop is a function that does the actual migration. However, the task to be migrated may be currently running or waiting in some runqueue to run on the source core (i.e, the core on which the task is currently scheduled). It's required to stop the task before the migrating it. This is achieved by adding the call to the function migration_cpu_stop to the queue of the stopper task associated with the source core. stop_one_cpu then sets the stopper task as ready for execution. The stopper task has the highest priority. So on the next timer interrupt on the source core (Which could be the same as the current core), one of the tasks with the highest priority will be selected to run. Eventually, the stopper task will run and it will execute migration_cpu_stop, which in turn performs the migration. Since this process involves a hardware interrupt, all stores of the target task are guaranteed to be globally observable.
There appears to be a bug in x2apic_wrmsr_fence
The purpose of x2apic_wrmsr_fence is to make all previous stores globally observable before sending the IPI. As discussed in this thread, SFENCE is not sufficient here. To see why, consider the following sequence:
store
sfence
wrmsr
The store fence here can order the preceding store operation, but not the MSR write. The WRMSR instruction doesn't have any serializing properties when writing to an APIC register in x2APIC mode. This is mentioned in the Intel SDM volume 3 Section 10.12.3:
To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the
APIC registers.
The problem here is that MFENCE is also not guaranteed to order the later WRMSR with respect to previous stores. On Intel processors, it's documented to only order memory operations. Only on AMD processors it's guaranteed to be fully serializing. So to make it work on Intel processors, there needs to be an LFENCE after the MFENCE (SFENCE is not ordered with LFENCE, so MFENCE must be used even though we don't need to order loads). Actually Section 10.12.3 mentions this.

If a platform is going to support moving a thread from one core to another, whatever code does that moving must respect whatever guarantees a thread is allowed to rely on. If a thread is allowed to rely on the guarantee that a read after a write will see the updated value, then whatever code migrates a thread from one core to another must ensure that guarantee is preserved.
Everything else is platform specific. If a platform has an L1 cache then hardware must make that cache fully coherent or some form of invalidation or flushing will be necessary. On most typical modern processors, hardware makes the cache only partially coherent because reads can also be prefetched and writes can be posted. On x86 CPUs, special hardware magic solves the prefetch problem (the prefetch is invalidated if the L1 cache line is invalidated). I believe the OS and/or scheduler has to specifically flush posted writes, but I'm not entirely sure and it may vary based on the exact CPU.
The CPU goes to great cost to ensure that a write will always see a previous read in the same instruction stream. For an OS to remove this guarantee and require all user-space code to work without it would be a complete non-starter since user-space code has no way to know where in its code it might get migrated.

Adding my two bits here. On first glance, a barrier seems like an overkill (answers above)
Consider this logic: when a thread wants to write to a cacheline, HW cache coherence kicks in and we need to invalidate all other copies of the cacheline that are present with other cores in the system; the write doesn't proceed without the invalidations. When a thread is re-scheduled to a different core then, it will have to fetch the cacheline from the L1-cache that has write permission thereby maintaining read-after-write sequential behavior.
The problem with this logic is that invalidations from cores aren't applied immediately, hence it is possible to read a stale value after being rescheduled (the read to the new L1-cache somehow beats the pending invalidation present in a queue with that core). This is ok for different threads because they are allowed to slip and slide, but with the same thread a barrier becomes essential.

Context switch between kernel threads vs user threads

Copy pasted from this link:
Thread switching does not require Kernel mode privileges.
User level threads are fast to create and manage.
Kernel threads are generally slower to create and manage than the user threads.
Transfer of control from one thread to another within the same process requires a mode switch to the Kernel.
I never came across these points while reading standard operating systems reference books. Though these points sound logical, I wanted to know how they reflect in Linux. To be precise :
Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
Can someone link me to Linux source code line (say on github) handling context switch.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?

Can someone give detailed steps involved in context switching between user threads and kernel threads, so that I can find the step difference between the two.
Let's imagine a thread needs to read data from a file, but the file isn't cached in memory and disk drives are slow so the thread has to wait; and for simplicity let's also assume that the kernel is monolithic.
For kernel threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes, sets the thread to "blocked waiting for IO" and switches to a different thread (that may belong to a completely different process, depending on global thread priorities). The kernel returns to the user-space of whatever thread it switch to.
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do the for (currently blocked) thread and unblocks that thread. At this point the kernel might decide to switch to the "now unblocked" thread; and the kernel returns to the user-space of the "now unblocked" thread.
For user threading:
thread calls a "read()" function in a library or something; which must cause at least a switch to kernel code (because it's going to involve device drivers).
the kernel adds the IO request to the disk driver's "queue of possibly many pending requests"; realizes the thread will need to wait until the request completes but can't take care of that because some fool decided to make everything worse by doing thread switching in user space, so the kernel returns to user-space with "IO request has been queued" status.
after the pointless extra overhead of switching back to user-space; the user-space scheduler does the thread switch that the kernel could have done. At this point the user-space scheduler will either tell kernel it has nothing to do and you'll have more pointless extra overhead switching back to kernel; or user-space scheduler will do a thread switch to another thread in the same process (which may be the wrong thread because a thread in a different process is higher priority).
later; the disk hardware causes an IRQ which causes a switch back to the IRQ handler in kernel code. The disk driver finishes up the work it had to do for the (currently blocked) thread; but the kernel isn't able to do the thread switch to unblock the thread because some fool decided to make everything worse by doing thread switching in user space. Now we've got a problem - how does kernel inform the user-space scheduler that the IO has finished? To solve this (without any "user-space scheduler running zero threads constantly polls kernel" insanity) you have to have some kind of "kernel puts notification of IO completion on some kind of queue and (if the process was idle) wakes the process up" which (on its own) will be more expensive than just doing the thread switch in the kernel. Of course if the process wasn't idle then code in user-space is going to have to poll its notification queue to find out if/when the "notification of IO completion" arrives, and that's going to increase latency and overhead. In any case, after lots of stupid pointless and avoidable overhead; the user-space scheduler can do the thread switch.
Can someone explain the difference with actual context switch example or code. May be system calls involved (in case of context switching between kernel threads) and thread library calls involved (in case of context switching between user threads).
The actual low-level context switch code typically begins with something like:
save whichever registers are "caller preserved" according to the calling conventions on the stack
save the current stack top in some kind of "thread info structure" belonging to the old thread
load a new stack top from some kind of "thread info structure" belonging to the new thread
pop whichever registers are "caller preserved" according to the calling conventions
return
However:
usually (for modern CPUs) there's a relatively large amount of "SIMD register state" (e.g. for 80x86 with support for AVX-512 I think it's over 4 KiB of of stuff). CPU manufacturers often have mechanisms to avoid saving parts of that state if it wasn't changed, and to (optionally) postpone the loading of (pieces of) that state until its actually used (and avoid it completely if its not actually used). All of that requires kernel.
if it's a task switch and not just used for thread switches you might need some kind of "if virtual address space needs to change { change virtual address space }" on top of that
normally you want to keep track of statistics, like how much CPU time a thread has used. This requires some kind of "thread_info.time_used += now() - time_at_last_thread_switch;"; which gets difficulty/ugly when "process switching" is separated from "thread switching".
normally there's other state (e.g. pointer to thread local storage, special registers for performance monitoring and/or debugging, ...) that may need to be saved/loaded during thread switches. Often this state is not directly accessible in user code.
normally you also want to set a timer to expire when the thread has used too much time; either because you're doing some kind of "time multiplexing" (e.g. round-robin scheduler) or because its a cooperating scheduler where you need to have some kind of "terminate this task after 5 seconds of not responding in case it goes into an infinite loop forever" safe-guard.
this is just the low level task/thread switching in isolation. There is almost always higher level code to select a task to switch to, handle "thread used too much CPU time", etc.
Can someone link me to Linux source code line (say on github) handling context switch
Someone probably can't. It's not one line; it's many lines of assembly for each different architecture, plus extra higher-level code (for timers, support routines, the "select a task to switch to" code, for exception handlers to support "lazy SIMD state load", ...); which probably all adds up to something like 10 thousand lines of code spread across 50 files.
I also doubt why context switch between kernel threads requires changing to kernel mode. Aren't we already in kernel mode for first thread?
Yes; often you're already in kernel code when you find out that a thread switch is needed.
Rarely/sometimes (mostly only due to communication between threads belonging to the same process - e.g. 2 or more threads in the same process trying to acquire the same mutex/semaphore at the same time; or threads sending data to each other and waiting for data from each other to arrive) kernel isn't involved; and in some cases (which are almost always massive design failures - e.g. extreme lock contention problems, failure to use "worker thread pools" to limit the number of threads needed, etc) it's possible for this to be the dominant cause of thread switches, and therefore possible that doing thread switches in user space can be beneficial (e.g. as a work-around for the massive design failures).

Don't limit yourself to Linux or even UNIX, they are neither the first nor last word on systems or programming models. The synchronous execution model dates back to the early days of computing, and are not particularly well suited to larger scale concurrent and reactive programming.
Golang, for example, employs a great many lightweight user threads -- goroutines -- and multiplexes them on a smaller set of heavyweight kernel threads to produce a more compelling concurrency paradigm. Some other programming systems take similar approaches.

Out-of-order execution and reordering: can I see what after barrier before the barrier?

According to wikipedia: A memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.
Usually, articles talking about something like (I will use monitors instead of membars):
class ReadWriteExample {
int A = 0;
int Another = 0;
//thread1 runs this method
void writer () {
lock monitor1; //a new value will be stored
A = 10; //stores 10 to memory location A
unlock monitor1; //a new value is ready for reader to read
Another = 20; //#see my question
}
//thread2 runs this method
void reader () {
lock monitor1; //a new value will be read
assert A == 10; //loads from memory location A
print Another //#see my question
unlock monitor1;//a new value was just read
}
}
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
I.e. by definition operations issued prior to barrier can't be pushed down by compiler, but is it possible that operations issued after barrier would be occasionally seen before barrier? (just a probability)
Thanks

My answer below only addresses Java's memory model. The answer really can't be made for all languages as each may define the rules differently.
But I wonder is it possible that compiler or cpu will shuffle the things around in a such way that code will print 20? I don't need guarantee.
Your answer seems to be "Is it possible for the store of A = 20, be re-ordered above the unlock monitor?"
The answer is yes, it can be. If you look at the JSR 166 Cookbook, the first grid shown explains how re-orderings work.
In your writer case the first operation would be MonitorExit the second operation would be NormalStore. The grid explains, yes this sequence is permitted to be re-ordered.
This is known as Roach Motel ordering, that is, memory accesses can be moved into a synchronized block but cannot be moved out
What about another language? Well, this question is too broad to answer all questions as each may define the rules differently. If this is the case you would need to refine your question.

In Java there is the concept of happens-before. You can read all the details about it on in the Java Specification. A Java compiler or runtime engine can re-order code but it must abide by the happens-before rules. These rules are important for a Java developer that wants to have detailed control on how their code is re-ordered. I myself have been burnt by re-ordering code, turns out I was referencing the same object via two different variables and the runtime engine re-ordered my code not realizing that the operations were on the same object. If I had either a happens-before (between the two operations) or used the same variable, then the re-ordering would not have occurred.
Specifically:
It follows from the above definitions that:
An unlock on a monitor happens-before every subsequent lock on that monitor.
A write to a volatile field (§8.3.1.4) happens-before every subsequent
read of that field.
A call to start() on a thread happens-before any actions in the
started thread.
All actions in a thread happen-before any other thread successfully
returns from a join() on that thread.
The default initialization of any object happens-before any other
actions (other than default-writes) of a program.

Short answer - yes. This is very compiler and CPU architecture dependent. You have here the definition of a Race Condition. The scheduling Quantum won't end mid-instruction (can't have two writes to same location). However - the quantum could end between instructions - plus how they are executed out-of-order in the pipeline is architecture dependent (outside of the monitor block).
Now comes the "it depends" complications. The CPU guarantees little (see race condition). You might also look at NUMA (ccNUMA) - it is a method to scale CPU & Memory access by grouping CPUs (Nodes) with local RAM and a group owner - plus a special bus between Nodes.
The monitor doesn't prevent the other thread from running. It only prevents it from entering the code between the monitors. Therefore when the Writer exits the monitor-section it is free to execute the next statement - regardless of the other thread being inside the monitor. Monitors are gates that block access. Also - the quantum could interrupt the second thread after the A== statement - allowing Another to change value. Again - the quantum won't interrupt mid-instruction. Always think of threads executing in perfect parallel.
How do you apply this? I'm a bit out of date (sorry, C#/Java these days) with current Intel processors - and how their Pipelines work (hyperthreading etc). Years ago I worked with a processor called MIPS - and it had (through compiler instruction ordering) the ability to execute instructions that occurred serially AFTER a Branch instruction (Delay Slot). On this CPU/Compiler combination - YES - what you describe could happen. If Intel offers the same - then yes - it could happen. Esp with the NUMA (both Intel & AMD have this, I'm most familiar with AMD implementation).
My point - if threads were running across NUMA nodes - and access was to the common memory location then it could occur. Of course the OS tries hard to schedule operations within the same node.
You might be able to simulate this. I know C++ on MS allows access to NUMA technology (I've played with it). See if you can allocate memory across two nodes (placing A on one, and Another on the other). Schedule the threads to run on specific Nodes.
What happens in this model is that there are two pathways to RAM. I suppose this isn't what you had in mind - probably only a single path/Node model. In which case I go back to the MIPS model I described above.
I assumed a processor that interrupts - there are others that have a Yield model.

What was the `FUTEX_REQUEUE` bug?

I assign the Linux FUTEX(2) man page as required reading in operating systems classes, as a warning to students not to get complacent when designing synchronization primitives.
The futex() system call is the API that Linux provides to allow user-level thread synchronization primitives to sleep and wake up when necessary. The man page describes the 5 different operations that can be invoked using the futex() system call. The two fundamental operations are FUTEX_WAIT (which a thread uses to put itself to sleep when it tries to acquire a synchronization object and someone is already holding it), and FUTEX_WAKE (which a thread uses to wake up any waiting threads when it releases a synchronization object.)
The next three operations are where the fun starts. The man page description goes like this:
FUTEX_FD (present up to and including Linux 2.6.25)
[...]
Because it was inherently racy, FUTEX_FD has been removed
from Linux 2.6.26 onward.
The paper "Futexes are Tricky" by Ulrich Dreper, 2004 describes that race condition (it's a potential missed wakeup). But there's more:
FUTEX_REQUEUE (since Linux 2.5.70)
This operation was introduced in order to avoid a
"thundering herd" effect when FUTEX_WAKE is used and all
processes woken up need to acquire another futex. [...]
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
There was a race in the intended use of FUTEX_REQUEUE, so
FUTEX_CMP_REQUEUE was introduced. [...]
What was the race in FUTEX_REQUEUE? Ulrich's paper doesn't even mention it (the paper describes a function futex_requeue() that is implemented using FUTEX_CMP_REQUEUE, but not the FUTEX_REQUEUE operation).

It looks like the race condition is due to the implementation of mutex's in glibc and their disparity with futexes. FUTEX_CMP_REQUEUE seems to be needed to support the more complicated glibc mutexes:
They are much more complex because they support many more features, such as testing for deadlock, and recursive locking. Due to this, they have an internal lock protecting the extra state. This extra lock means that they cannot use the FUTEX_REQUEUE multiplex function due to a possible race.
Source: http://locklessinc.com/articles/futex_cheat_sheet/

The old requeue operation takes two addresses addr1 and addr2, first it unpark waiters on addr1, then parks them back on addr2.
The new requeue operation does all that after it verifies *addr1 == user_provided_val.
To find out the possible race condition, consider the following two threads:
wait(cv, mutex);
lock(&cv.lock);
cv.mutex_ref = &mutex;
unlock(&mutex);
let futexval = ++cv.futex;
unlock(&cv.lock);
FUTEX_WAIT(&cv.futex, futexval); // --- (1)
lock(&mutex);
broadcast(cv);
lock(&cv.lock);
let futexval = cv.futex;
unlock(&cv.lock);
FUTEX_CMP_REQUEUE(&cv.futex, // --- (2)
1 /*wake*/,
ALL /*queue*/,
&cv.mutex_ref.lock,
futexval);
Both syscall (1) and (2) are executed without lock, but it is required that they are in the same total order as the mutex lock, so that a signal doesn't appear missing to the user.
Therefore, in order to detect a wait operation reordering after the actual wake, the futexval acquired in lock is passed to kernel at (2).
Similarly, we pass futexval to the FUTEX_WAIT call at (1). This design is explicitly stated in futex man page:
When executing a futex operation that requests to block a thread,
the kernel will block only if the futex word has the value that
the calling thread supplied (as one of the arguments of the
futex() call) as the expected value of the futex word. The
loading of the futex word's value, the comparison of that value
with the expected value, and the actual blocking will happen
atomically and will be totally ordered with respect to concurrent
operations performed by other threads on the same futex word.
Thus, the futex word is used to connect the synchronization in
user space with the implementation of blocking by the kernel.
Analogously to an atomic compare-and-exchange operation that
potentially changes shared memory, blocking via a futex is an
atomic compare-and-block operation.
IMHO, the reason for calling (2) outside of lock is mainly performance. To calling wake while holding lock will lead to "hurry up and wait" situation where the waiter wakes up and unable to acquire lock.
It's also worth mentioning that the above answer is based on a history version of pthread implementation. The latest version of pthread_cond has removed the usage of REQUEUE. (check this patch for details).

Atomic Instructions and Variable Update visibility

On most common platforms (the most important being x86; I understand that some platforms have extremely difficult memory models that provide almost no guarantees useful for multithreading, but I don't care about rare counter-examples), is the following code safe?
Thread 1:
someVariable = doStuff();
atomicSet(stuffDoneFlag, 1);
Thread 2:
while(!atomicRead(stuffDoneFlag)) {} // Wait for stuffDoneFlag to be set.
doMoreStuff(someVariable);
Assuming standard, reasonable implementations of atomic ops:
Is Thread 1's assignment to someVariable guaranteed to complete before atomicSet() is called?
Is Thread 2 guaranteed to see the assignment to someVariable before calling doMoreStuff() provided it reads stuffDoneFlag atomically?
Edits:
The implementation of atomic ops I'm using contains the x86 LOCK instruction in each
operation, if that helps.
Assume stuffDoneFlag is properly cleared somehow. How isn't important.
This is a very simplified example. I created it this way so that you wouldn't have to understand the whole context of the problem to answer it. I know it's not efficient.

If your actual x86 code has the store to someVariable before the store in atomicSet in Thread 1 and load of someVariable after the load in atomicRead in Thread 2, then you should be fine. Intel's Software Developer's Manual Volume 3A specifies the memory model for x86 in Section 8.2, and the intra-thread store-store and load-load constraints should be enough here.
However, there may not be anything preventing your compiler from reordering the instructions generated from whatever higher-level language you are using across the atomic operations.

1)Yes
2)Yes
Both work.

This code looks thread safe, but I question the efficiency of your spinlock (the while loop) unless you are only spinning for a very short amount of time. There is no guarantee on any given system that Thread 2 won't completely hog all processing time.
I would recommend using some actual synchronization primitives (looks like boost::condition_variable is what you want here) instead of relying on the spin lock.

The atomic instructions ensure that the thread 2 waits for thread 1 to complete setting the variable before thread 2 proceeds. There are, however, two key issues:
1) the someVariable must be declared 'volatile' to ensure that the compiler does not optimise it's allocation e.g. storing it in a register or deferring a write.
2) the second thread is blocking while waiting for the signal (termed spinlocking). Your platform probably provides much better locking and signalling primatives and mechanisms, but a relatively straightforward improvement would be to simply sleep() in the thread 2's while() body.

dsimcha written: "Assume stuffDoneFlag is properly cleared somehow. How isn't important."
This is not true!
Let's see scenario:
Thread2 checks the stuffDoneFlag if it's 1 start reading someVariable.
Before the Thread2 finish reading the task scheduler interrupt its task and suspend the task for some time.
Thread1 again access to the someVariable and change the memory content.
Task scheduler switch on again Thread2 and it continue the job but memory content of someVariable is changed!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string