Locks VS Memory Barriers - multithreading

Locks VS Memory Barriers - multithreading

When can locks be avoided by using memory barriers ? Like in this case : Consumer-Producer-Wiki ( The last implementation is using memory barrier and rest all are using locks or semaphores.)
Is it better to use memory barrier in such cases than locks? If yes, Why?
What are the most common situations where locks can be avoided using memory barriers ?

A mutex needs to perform a memory barrier by definition. If a mutex didn't include a memory barrier, operations on the data to which it was providing exclusive access could be reordered outside of the critical region.
Separately, mutexes implementations generally need to perform a memory barrier themselves else the mutex itself could get reordered.
In other words, the implementation of a mutex includes a memory barrier in order to perform the atomic locking necessary to enforce mutual exclusion, but even if it did not need to, a mutex would include a memory barrier else it wouldn't be very useful.
Conclusion: If you don't want your code to break, use locks everywhere even where memory barriers are required.
inline void mb_()
{
mutex_t mtx;
mutex_init(&mtx);
mutex_lock(&mtx);
mutex_unlock(&mtx);
}
If you are aiming for performance, you might require advanced techniques.

Related

Can atomic ops based spin lock's Unlock directly set the lock flag to zero?

Say for example, I have an exclusive atomic-ops-based spin lock implementation as below:
bool TryLock(volatile TInt32 * pFlag)
{
return !(AtomicOps::Exchange32(pFlag, 1) == 1);
}
void Lock (volatile TInt32 * pFlag)
{
while (AtomicOps::Exchange32(pFlag, 1) == 1) {
AtomicOps::ThreadYield();
}
}
void Unlock (volatile TInt32 * pFlag)
{
*pFlag = 0; // is this ok? or here as well a atomicity is needed for load and store
}
Where AtomicOps::Exchange32 is implemented on windows using InterlockedExchange and on linux using __atomic_exchange_n.

In most cases, for releasing the resource, just resetting the lock to zero (as you do) is almost OK (e.g. on an Intel Core processor) but you need also to make sure that the compiler will not exchange instructions (see below, see also g-v's post). If you want to be rigorous (and portable), there are two things that need to be considered :
What the compiler does: It may exchange instructions for optimizing the code, and thus introduce some subtle bugs if it is not "aware" of the multithreaded nature of the code. To avoid that, it is possible to insert a compiler barrier.
What the processor does: Some processors (like Intel Itanium, used in professional servers, or ARM processors used in smart phones) have a so-called "relaxed memory model". In practice, it means that the processor may decide to change the order of the operations. Again, this can be avoided by using special instructions (load barrier and store barrier). For instance, in an ARM processor, the instruction DMB ensures that all store operations are completed before the next instruction (and it needs to be inserted in the function that releases a lock)
Conclusion: It is very tricky to make the code correct, if you have some compiler / OS support for these functionalities (e.g., stdatomics.h, or std::atomic in C++0x), it is much better to rely on them than writing your own (but sometimes you have no choice). In the specific case of standard Intel Core processor, I think that what you do is correct, provided you insert a compiler-barrier in the release operation (see g-v's post).
On compile-time versus run-time memory ordering, see: https://en.wikipedia.org/wiki/Memory_ordering
My code for some atomic / spinlocks implemented on different architectures:
http://alice.loria.fr/software/geogram/doc/html/atomics_8h.html
(but I'm unsure it's 100 % correct)

You need two memory barriers in spinlock implementation:
"acquire barrier" or "import barrier" in TryLock() and Lock(). It forces operations issued while spinlock is acquired to be visible only after pFlag value is updated.
"release barrier" or "export barrier" in Unlock(). It forces operations issued until spinlock was released to be visible before pFlag value is updated.
You also need two compiler barriers for the same reasons.
See this article for details.
This approach is for generic case. On x86/64:
there are no separate acquire/release barriers, but only single full barrier (memory fence);
there is no need for memory barriers here at all, since this architecture is strongly ordered;
you still need compiler barriers.
More details are provided here.
Below is an example implementation using GCC atomic builtins. It will work for all architectures supported by GCC:
it will insert acquire/release memory barriers on architectures where they are required (or full barrier if acquire/release barriers are not supported but architecture is weakly ordered);
it will insert compiler barriers on all architectures.
Code:
bool TryLock(volatile bool* pFlag)
{
// acquire memory barrier and compiler barrier
return !__atomic_test_and_set(pFlag, __ATOMIC_ACQUIRE);
}
void Lock(volatile bool* pFlag)
{
for (;;) {
// acquire memory barrier and compiler barrier
if (!__atomic_test_and_set(pFlag, __ATOMIC_ACQUIRE)) {
return;
}
// relaxed waiting, usually no memory barriers (optional)
while (__atomic_load_n(pFlag, __ATOMIC_RELAXED)) {
CPU_RELAX();
}
}
}
void Unlock(volatile bool* pFlag)
{
// release memory barrier and compiler barrier
__atomic_clear(pFlag, __ATOMIC_RELEASE);
}
For "relaxed waiting" loop, see this and this questions.
See also Linux kernel memory barriers as a good reference.
In your implementation:
Lock() calls AtomicOps::Exchange32() which already includes compiler barrier and perhaps acquire or full memory barrier (we don't know because you didn't provide actual arguments to __atomic_exchange_n()).
Unlock() misses both memory and compiler barriers so it's broken.
Also consider using pthread_spin_lock() if it is an option.

Which is more efficient, basic mutex lock or atomic integer?

For something simple like a counter if multiple threads will be increasing the number. I read that mutex locks can decrease efficiency since the threads have to wait. So, to me, an atomic counter would be the most efficient, but I read that internally it is basically a lock? So I guess I'm confused how either could be more efficient than the other.

Atomic operations leverage processor support (compare and swap instructions) and don't use locks at all, whereas locks are more OS-dependent and perform differently on, for example, Win and Linux.
Locks actually suspend thread execution, freeing up cpu resources for other tasks, but incurring in obvious context-switching overhead when stopping/restarting the thread.
On the contrary, threads attempting atomic operations don't wait and keep trying until success (so-called busy-waiting), so they don't incur in context-switching overhead, but neither free up cpu resources.
Summing up, in general atomic operations are faster if contention between threads is sufficiently low. You should definitely do benchmarking as there's no other reliable method of knowing what's the lowest overhead between context-switching and busy-waiting.

If you have a counter for which atomic operations are supported, it will be more efficient than a mutex.
Technically, the atomic will lock the memory bus on most platforms. However, there are two ameliorating details:
It is impossible to suspend a thread during the memory bus lock, but it is possible to suspend a thread during a mutex lock. This is what lets you get a lock-free guarantee (which doesn't say anything about not locking - it just guarantees that at least one thread makes progress).
Mutexes eventually end up being implemented with atomics. Since you need at least one atomic operation to lock a mutex, and one atomic operation to unlock a mutex, it takes at least twice long to do a mutex lock, even in the best of cases.

A minimal (standards compliant) mutex implementation requires 2 basic ingredients:
A way to atomically convey a state change between threads (the 'locked' state)
memory barriers to enforce memory operations protected by the mutex to stay inside the protected area.
There is no way you can make it any simpler than this because of the 'synchronizes-with' relationship the C++ standard requires.
A minimal (correct) implementation might look like this:
class mutex {
std::atomic<bool> flag{false};
public:
void lock()
{
while (flag.exchange(true, std::memory_order_relaxed));
std::atomic_thread_fence(std::memory_order_acquire);
}
void unlock()
{
std::atomic_thread_fence(std::memory_order_release);
flag.store(false, std::memory_order_relaxed);
}
};
Due to its simplicity (it cannot suspend the thread of execution), it is likely that, under low contention, this implementation outperforms a std::mutex.
But even then, it is easy to see that each integer increment, protected by this mutex, requires the following operations:
an atomic store to release the mutex
an atomic compare-and-swap (read-modify-write) to acquire the mutex (possibly multiple times)
an integer increment
If you compare that with a standalone std::atomic<int> that is incremented with a single (unconditional) read-modify-write (eg. fetch_add),
it is reasonable to expect that an atomic operation (using the same ordering model) will outperform the case whereby a mutex is used.

atomic integer is a user mode object there for it's much more efficient than a mutex which runs in kernel mode. The scope of atomic integer is a single application while the scope of the mutex is for all running software on the machine.

The atomic variable classes in Java are able to take advantage of Compare and swap instructions provided by the processor.
Here's a detailed description of the differences: http://www.ibm.com/developerworks/library/j-jtp11234/

Mutex is a kernel level semantic which provides mutual exclusion even at the Process level. Note that it can be helpful in extending mutual exclusion across process boundaries and not just within a process (for threads). It is costlier.
Atomic Counter, AtomicInteger for e.g., is based on CAS, and usually try attempting to do operation until succeed. Basically, in this case, threads race or compete to increment\decrement the value atomically. Here, you may see good CPU cycles being used by a thread trying to operate on a current value.
Since you want to maintain the counter, AtomicInteger\AtomicLong will be the best for your use case.

Most processors have supported an atomic read or write, and often an atomic cmp&swap. This means that the processor itself writes or reads the latest value in a single operation, and there might be a few cycles lost compared to a normal integer access, especially as the compiler can't optimise around atomic operations nearly as well as normal.
On the other hand a mutex is a number of lines of code to enter and leave, and during that execution other processors that access the same location are totally stalled, so clearly a big overhead on them. In unoptimised high-level code, the mutex enter/exit and the atomic will be function calls, but for mutex, any competing processor will be locked out while your mutex enter function returns, and while your exit function is started. For atomic, it is only the duration of the actual operation which is locked out. Optimisation should reduce that cost, but not all of it.
If you are trying to increment, then your modern processor probably supports atomic increment/decrement, which will be great.
If it does not, then it is either implemented using the processor atomic cmp&swap, or using a mutex.
Mutex:
get the lock
read
increment
write
release the lock
Atomic cmp&swap:
atomic read the value
calc the increment
do{
atomic cmpswap value, increment
recalc the increment
}while the cmp&swap did not see the expected value
So this second version has a loop [incase another processor increments the value between our atomic operations, so value no longer matches, and increment would be wrong] that can get long [if there are many competitors], but generally should still be quicker than the mutex version, but the mutex version may allow that processor to task switch.

Do spinlocks really need DMB?

I'm working with a dual Cortex-A9 system and I've been trying to
understand exactly why spinlock functions need to use DMB. It seems
that as long as the merging store buffer is flushed the lock value
should end up in the L1 on the unlocking core and the SCU should
either invalidate or update the value in the L1 of the other core.
This is enough to maintain coherency and safe locking right? And
doesn't STREX skip the merging store buffer anyway, meaning we don't
even need the flush?
DMB appears to be something of a blunt hammer, especially since it
defaults to the system domain, which likely means a write all the way
to main memory, which can be expensive.
Are the DMBs in the locks there as a workaround for drivers that don't
use smp_mb properly?
I'm currently seeing, based on the performance counters, about 5% of
my system cycles disappearing in stalls caused by DMB.

I found these articles may answer your question:
Locks, SWPs and two Smoking Barriers
Locks, SWPs and two Smoking Barriers (Part 2)
In particular:
You will note the Data Memory Barrier (DMB) instruction that is issued once the lock has been acquired. The DMB guarantees that all memory accesses before the memory barrier will be observed by all of the other CPUs in the system before all memory accesses made after the memory barrier. This makes more sense if you consider that once a lock has been acquired, a program will then access the data structure(s) locked by the lock. The DMB in the lock function above ensures that accesses to the locked data structure are observed after accesses to the lock.

The DMB is needed in the SMP case because the other processor may see the memory accesses happening in a different order without it, i.e. accesses from inside the critical section may happen before the lock is taken from the point-of-view of the second core.
So the second core could see itself holding the lock and also see updates from inside the cricital section running on the other core, breaking consistency.

What is the difference between semaphore and mutex in implementation?

I read that mutex and binary semaphore are different in only one aspect, in the case of mutex the locking thread has to unlock, but in semaphore the locking and unlocking thread can be different?
Which one is more efficient?

Assuming you know the basic differences between a sempahore and mutex :
For fast, simple synchronization, use a critical section.
To synchronize threads across process boundaries, use mutexes.
To synchronize access to limited resources, use a semaphore.
Apart from the fact that mutexes have an owner, the two objects may be optimized for different usage. Mutexes are designed to be held only for a short time; violating this can cause poor performance and unfair scheduling. For example, a running thread may be permitted to acquire a mutex, even though another thread is already blocked on it, creating a deadlock. Semaphores may provide more fairness, or fairness can be forced using several condition variables.

How do threaded systems cope with shared data being being cached by different cpus?

I'm coming largely from a c++ background, but I think this question applies to threading in any language. Here's the scenario:
We have two threads (ThreadA and ThreadB), and a value x in shared memory
Assume that access to x is appropriately controlled by a mutex (or other suitable synchronization control)
If the threads happen to run on different processors, what happens if ThreadA performs a write operation, but its processor places the result in its L2 cache rather than the main memory? Then, if ThreadB tries to read the value, will it not just look in its own L1/L2 cache / main memory and then work with whatever old value was there?
If that's not the case, then how is this issue managed?
If that is the case, then what can be done about it?

Your example would work just fine.
Multiple processors use a coherency protocol such as MESI to ensure that data remains in sync between the caches. With MESI, each cache line is considered to be either modified, exclusively held, shared between CPU's, or invalid. Writing a cache line that is shared between processors forces it to become invalid in the other CPU's, keeping the caches in sync.
However, this is not quite enough. Different processors have different memory models, and most modern processors support some level of re-ordering memory accesses. In these cases, memory barriers are needed.
For instance if you have Thread A:
DoWork();
workDone = true;
And Thread B:
while (!workDone) {}
DoSomethingWithResults()
With both running on separate processors, there is no guarantee that the writes done within DoWork() will be visible to thread B before the write to workDone and DoSomethingWithResults() would proceed with potentially inconsistent state. Memory barriers guarantee some ordering of the reads and writes - adding a memory barrier after DoWork() in Thread A would force all reads/writes done by DoWork to complete before the write to workDone, so that Thread B would get a consistent view. Mutexes inherently provide a memory barrier, so that reads/writes cannot pass a call to lock and unlock.
In your case, one processor would signal to the others that it dirtied a cache line and force the other processors to reload from memory. Acquiring the mutex to read and write the value guarantees that the change to memory is visible to the other processor in the order expected.

Most locking primitives like mutexes imply memory barriers. These force a cache flush and reload to occur.
For example,
ThreadA {
x = 5; // probably writes to cache
unlock mutex; // forcibly writes local CPU cache to global memory
}
ThreadB {
lock mutex; // discards data in local cache
y = x; // x must read from global memory
}

In general, the compiler understands shared memory, and takes considerable effort to assure that shared memory is placed in a sharable place. Modern compilers are very complicated in the way that they order operations and memory accesses; they tend to understand the nature of threading and shared memory. That's not to say that they're perfect, but in general, much of the concern is taken care of by the compiler.

C# has some build in support for this kind of problems.
You can mark an variable with the volatile keyword, which forces it to be synchronized on all cpu's.
public static volatile int loggedUsers;
The other part is a syntactical wrappper around the .NET methods called Threading.Monitor.Enter(x) and Threading.Monitor.Exit(x), where x is an variable to lock. This causes other threads trying to lock x to have to wait untill the locking thread calls Exit(x).
public list users;
// In some function:
System.Threading.Monitor.Enter(users);
try {
// do something with users
}
finally {
System.Threading.Monitor.Exit(users);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string