Understanding issues with atomic lock operations in case of multi processors

Understanding issues with atomic lock operations in case of multi processors - multithreading

In case of uniprocessor, we disable interrupts before performing a lock operation (Lock acquire, Lock release) to prevent context
switching, then after the operation we re-enable it.
But, in the case of multi-processor CPU, just disabling interrupts is not sufficient to make the lock operations atomic.
I read from a source that,
"It happens as each processor has a cache, and they can write to the same memory even with the interrupts being disabled."
Q1. Why this even matters in case of atomic lock operation?
Q2. What are the other issues that arise while implementing lock operations in multi-processor environment with only disabling the interrupts?

Only disabling interrupts is insufficient, as the threads running on multiprocessors can still access the data structures and codes inside the functions of synchronization objects at the same time, hence atomicity can not be achieved by just disabling the interrupts.
For example, let L be an LOCK object and L.status is "FREE" and a X is a process that has four threads T1, T2, T3, T4 and each of them are running on separate processors P1, P2, P3, P4.
Let's assume the pseudo code for LOCK::acquire() is as following,
LOCK::acquire(){
if(status==BUSY){
lock.waitList.add(RunningThread);
TCB t == readyList.remove();
thread_switch(RunningThread,t);
t.state=running;
}
else{
status=BUSY;
}
}
If we disable only the interrupts, the codes of T1,T2,T3,T4 can still run on the corresponding processors. Let's assume that the lock is free at one moment.
If, all the threads try to acquire the lock-L at the same time, it is possible that they might end up checking the status of the lock at the same time , and in that case each of the threads will find status=="FREE", and every threads will acquire the lock, which would eliminate the applicability of the current locks implementation.
That is why, different atomic operations, such as test_and_set are used when implementing lock objects for multi processors. These atomic operations would allow only one thread from one multiprocessor access lock's codes at a time.

Related

Can multiple threads acquire lock on the same object?

I am taking a course on concurrency. The text says that multi-threading allows high throughput as it takes advantage of the multiples cores of the cpu.
I have a question about locking in the context of multiple cores. If we have multiple threads and they are running in different cpu cores, why can't two threads acquire the same lock? How does os protect against such scenarios?

Locking and locks are for synchronization to prevent data corruption when multiple threads want to write to the same memory.
Generally you run multiple threads and use locking only in critical situations.
If two or more threads want to write into the same place at the same time then the multi core calculation is limited. Of course you can use no locking in this situation but results can be unpredictable at that moment.
For example to write multi-threaded calculation of matrix multiplication you make a thread for every row of the resulting matrix. There is no locking needed because every thread writes to different place and this scenario can fully benefit from multiple processors.
If you want to permit more than one shared access to a resource then you can use Semaphore (in java).

If we have multiple threads and they are running in different cpu cores, why can't two threads acquire the same lock?
The purpose of mutex/lock is to implement mutual exclusion - only one thread can lock a mutex at a time. Or, in other words, many threads cannot lock the same mutex at the same time, by definition. This mechanism is needed to allow multiple threads to store into or read from a shared non-atomic resource without data race conditions.
How does os protect against such scenarios?
OS support is needed to prevent the threads from busy-waiting when locking a mutex that is already locked by another thread. Linux implementations of mutex (and semaphore) use futex to put the waiting threads to sleep and wake them up when the mutex is released.
Here is a longer explanation from Linus Torvalds of how mutex is implemented.

Multithreading on multiple core/processors

I get the idea that if locking and unlocking a mutex is an atomic operation, it can protect the critical section of code in case of a single processor architecture.
Any thread, which would be scheduled first, would be able to "lock" the mutex in a single machine code operation.
But how are mutexes any good when the threads are running on multiple cores? (Where different threads could be running at the same time on different "cores" at the same time).
I can't seem to grasp the idea of how a multithreaded program would work without any deadlock or race condition on multiple cores?

The general answer:
Mutexes are an operating system concept. An operating system offering mutexes has to ensure that these mutexes work correctly on all hardware that this operation system wants to support. If implementing a mutex is not possible for a specific hardware, the operating system cannot offer mutexes on that hardware. If the operating system requires the existence of mutexes to work correctly, it cannot support that hardware at all. How the operating system is implementing mutexes for a specific hardware is unsurprisingly very hardware dependent and varies a lot between the operating systems and their supported hardware.
The detailed answer:
Most general purpose CPUs offer atomic operations. These operations are designed to be atomic across all CPU cores within a system, whether these cores are part of a single or multiple individual CPUs.
With as little as two atomic operations, atomic_or and atomic_and, it is possible to implement a lock. E.g. think of
int atomic_or ( int * addr, int val )
It atomically calculates *addr = *addr | val and returns the old value of *addr prior to performing the calculation. If *lock == 0 and multiple threads call atomic_or(lock, 1), then only one of them will get 0 as result; only the first thread to perform that operation. All other threads get 1 as result. The one thread that got 0 is the winner, it has the lock, all other threads register for an event and go to sleep.
The winner thread now has exclusive access to the section following the atomic_or, it can perform the desired work and once it is done, it just clears the lock again (atomic_and(lock, 0)) and generates a system event, that the lock is now available again.
The system will then wake up one, some, or all of the threads that registered for this event before going to sleep and the race for the lock starts all over. Either one of the woken up threads will win the race or possibly none of them, as another thread was even faster and may have grabbed the lock in between the atomic_and and before the other threads were even woken up but that is okay and still correct, as it's still only one thread having access. All threads that failed to obtain the lock go back to sleep.
Of course, the actual implementations of modern systems are often much more complicated than that, they may take things like threads priorities into account (high prio threads may be preferred in the lock race) or might ensure that every thread waiting for a mutex will eventually also get it (precautions exist that prevent a thread from always losing the lock-race). Also mutexes can be recursive, in which case the system ensures that the same thread can obtain the same mutex multiple times without deadlocking and this requires some additional bookkeeping.
Probably needless to say but atomic operations are more expensive operations as they require the cores within a system to synchronize their work and this will slow their processing throughput. They may be somewhat expensive if all cores run on a single CPU but they may even be very expensive if there are multiple CPUs as the synchronization must take place over the CPU bus system that connects the CPUs with each other and this bus system usually does not operate at CPU speed level.
On the other hand, using mutexes will always slow down processing to begin with as providing exclusive access to resources has to slow down processing if multiple threads ever require access at the same time to continue their work. So for implementing mutexes this is irrelevant. Actually, if you can implement a function in a thread-safe way using just atomic operations instead of full featured mutexes, you will quite often have a noticeable speed benefit, despite these operations being more expensive than normal operations.

Threads are managed by the operating system, which among other things, is responsible for scheduling threads to cores, so it can also avoid scheduling a specific thread onto a core.
A mutex is an operating-system concept. You're basically asking the OS to block a thread until some other thread tells the OS it's ok

On modern operating systems, threads are an abstraction over the physical hardware. A programmer targets the thread as an abstraction for code execution. There is no separate abstraction for working on a hardware core available. The operating system is responsible for mapping threads to physical cores.
A mutex is a data structure that lives in system memory. Any thread that has access can read that memory position, regardless of what thread or core it is running in. It doesn't matter whether your code is executing on core 1 or 20, its still has the ability to read the current state of the lock.
In other words, regardless of the number of threads or cores, there is only shared system memory for them to act on.

What guarantee thread with spin lock on multiprocessor run on a different processor

I know spin lock only works on multiprocessor. But if two threads try to acquire the same resource and one is put on spinlock, what prevents the other one not running on the same processor? If it happens the one with spin lock will prevent the one holding the resources to exceed. In this case it becomes a deadlock. How does OS prevent it happen?

Some background facts first:
spin-locks (and locks generally) are not limited to multiprocessor systems. They work fine on single processor or even single-threaded application can use them without any harm.
spin-locks are not only provided by OS, they have pure user-space implementation as well. For example, tbb provides tbb::spin_mutex.
By default, nothing prevents a thread from running on any available CPU (regardless of the locks they use).
There are reentrant/recursive type of locks. It means that if a thread acquired it once, and tries to acquire it once again without releasing, it will succeed, not deadlock as usual locks. But it does not mean that the same applies to different threads just because they are scheduled to the same CPU. With any type of lock, if one software thread locked a mutex, other threads have to wait.
It is possible for one thread to acquire the lock and be preempted (i.e. interrupted by OS timer) before it releases the lock. Another thread can be scheduled to the same CPU and it might want to acquire the same lock. In case of pure spin-locks, this thread will uselessly spin until it exceeds its time-slice allowed by OS and will be preempted. Finally, the first thread will get a chance to run and release its lock so another thread will be able to acquire it.
As you can see, it is not quite efficient to spent the time on the hopeless waiting. Thus, more sophisticated implementations, after a number of attempts to acquire the spinlock, call OS for help in order to voluntary give away its time-slice to other threads which possibly can unlock the current one.

Java Thread Live Lock

I have an interesting problem related to Java thread live lock. Here it goes.
There are four global locks - L1,L2,L3,L4
There are four threads - T1, T2, T3, T4
T1 requires locks L1,L2,L3
T2 requires locks L2
T3 required locks L3,L4
T4 requires locks L1,L2
So, the pattern of the problem is - Any of the threads can run and acquire the locks in any order. If any of the thread detects that a lock which it needs is not available, it release all other locks it had previously acquired waits for a fixed time before retrying again. The cycle repeats giving rise to a live lock condition.
So, to solve this problem, I have two solutions in mind
1) Let each thread wait for a random period of time before retrying.
OR,
2) Let each thread acquire all the locks in a particular order ( even if a thread does not require all the
locks)
I am not convinced that these are the only two options available to me. Please advise.

Have all the threads enter a single mutex-protected state-machine whenever they require and release their set of locks. The threads should expose methods that return the set of locks they require to continue and also to signal/wait for a private semaphore signal. The SM should contain a bool for each lock and a 'Waiting' queue/array/vector/list/whatever container to store waiting threads.
If a thread enters the SM mutex to get locks and can immediately get its lock set, it can reset its bool set, exit the mutex and continue on.
If a thread enters the SM mutex and cannot immediately get its lock set, it should add itself to 'Waiting', exit the mutex and wait on its private semaphore.
If a thread enters the SM mutex to release its locks, it sets the lock bools to 'return' its locks and iterates 'Waiting' in an attempt to find a thread that can now run with the set of locks available. If it finds one, it resets the bools appropriately, removes the thread it found from 'Waiting' and signals the 'found' thread semaphore. It then exits the mutex.
You can twiddle with the algorithm that you use to match up the available set lock bools with waiting threads as you wish. Maybe you should release the thread that requires the largest set of matches, or perhaps you would like to 'rotate' the 'Waiting' container elements to reduce starvation. Up to you.
A solution like this requires no polling, (with its performance-sapping CPU use and latency), and no continual aquire/release of multiple locks.
It's much easier to develop such a scheme with an OO design. The methods/member functions to signal/wait the semaphore and return the set of locks needed can usually be stuffed somewhere in the thread class inheritance chain.

Unless there is a good reason (performance wise) not to do so,
I would unify all locks to one lock object.
This is similar to solution 2 you suggested, only more simple in my opinion.
And by the way, not only is this solution more simple and less bug proned,
The performance might be better than solution 1 you suggested.

Personally, I have never heard of Option 1, but I am by no means an expert on multithreading. After thinking about it, it sounds like it will work fine.
However, the standard way to deal with threads and resource locking is somewhat related to Option 2. To prevent deadlocks, resources need to always be acquired in the same order. For example, if you always lock the resources in the same order, you won't have any issues.

Go with 2a) Let each thread acquire all of the locks that it needs (NOT all of the locks) in a particular order; if a thread encounters a lock that isn't available then it releases all of its locks
As long as threads acquire their locks in the same order you can't have deadlock; however, you can still have starvation (a thread might run into a situation where it keeps releasing all of its locks without making forward progress). To ensure that progress is made you can assign priorities to threads (0 = lowest priority, MAX_INT = highest priority) - increase a thread's priority when it has to release its locks, and reduce it to 0 when it acquires all of its locks. Put your waiting threads in a queue, and don't start a lower-priority thread if it needs the same resources as a higher-priority thread - this way you guarantee that the higher-priority threads will eventually acquire all of their locks. Don't implement this thread queue unless you're actually having problems with thread starvation, though, because it's probably less efficient than just letting all of your threads run at once.
You can also simplify things by implementing omer schleifer's condense-all-locks-to-one solution; however, unless threads other than the four you've mentioned are contending for these resources (in which case you'll still need to lock the resources from the external threads), you can more efficiently implement this by removing all locks and putting your threads in a circular queue (so your threads just keep running in the same order).

Which is more efficient, basic mutex lock or atomic integer?

For something simple like a counter if multiple threads will be increasing the number. I read that mutex locks can decrease efficiency since the threads have to wait. So, to me, an atomic counter would be the most efficient, but I read that internally it is basically a lock? So I guess I'm confused how either could be more efficient than the other.

Atomic operations leverage processor support (compare and swap instructions) and don't use locks at all, whereas locks are more OS-dependent and perform differently on, for example, Win and Linux.
Locks actually suspend thread execution, freeing up cpu resources for other tasks, but incurring in obvious context-switching overhead when stopping/restarting the thread.
On the contrary, threads attempting atomic operations don't wait and keep trying until success (so-called busy-waiting), so they don't incur in context-switching overhead, but neither free up cpu resources.
Summing up, in general atomic operations are faster if contention between threads is sufficiently low. You should definitely do benchmarking as there's no other reliable method of knowing what's the lowest overhead between context-switching and busy-waiting.

If you have a counter for which atomic operations are supported, it will be more efficient than a mutex.
Technically, the atomic will lock the memory bus on most platforms. However, there are two ameliorating details:
It is impossible to suspend a thread during the memory bus lock, but it is possible to suspend a thread during a mutex lock. This is what lets you get a lock-free guarantee (which doesn't say anything about not locking - it just guarantees that at least one thread makes progress).
Mutexes eventually end up being implemented with atomics. Since you need at least one atomic operation to lock a mutex, and one atomic operation to unlock a mutex, it takes at least twice long to do a mutex lock, even in the best of cases.

A minimal (standards compliant) mutex implementation requires 2 basic ingredients:
A way to atomically convey a state change between threads (the 'locked' state)
memory barriers to enforce memory operations protected by the mutex to stay inside the protected area.
There is no way you can make it any simpler than this because of the 'synchronizes-with' relationship the C++ standard requires.
A minimal (correct) implementation might look like this:
class mutex {
std::atomic<bool> flag{false};
public:
void lock()
{
while (flag.exchange(true, std::memory_order_relaxed));
std::atomic_thread_fence(std::memory_order_acquire);
}
void unlock()
{
std::atomic_thread_fence(std::memory_order_release);
flag.store(false, std::memory_order_relaxed);
}
};
Due to its simplicity (it cannot suspend the thread of execution), it is likely that, under low contention, this implementation outperforms a std::mutex.
But even then, it is easy to see that each integer increment, protected by this mutex, requires the following operations:
an atomic store to release the mutex
an atomic compare-and-swap (read-modify-write) to acquire the mutex (possibly multiple times)
an integer increment
If you compare that with a standalone std::atomic<int> that is incremented with a single (unconditional) read-modify-write (eg. fetch_add),
it is reasonable to expect that an atomic operation (using the same ordering model) will outperform the case whereby a mutex is used.

atomic integer is a user mode object there for it's much more efficient than a mutex which runs in kernel mode. The scope of atomic integer is a single application while the scope of the mutex is for all running software on the machine.

The atomic variable classes in Java are able to take advantage of Compare and swap instructions provided by the processor.
Here's a detailed description of the differences: http://www.ibm.com/developerworks/library/j-jtp11234/

Mutex is a kernel level semantic which provides mutual exclusion even at the Process level. Note that it can be helpful in extending mutual exclusion across process boundaries and not just within a process (for threads). It is costlier.
Atomic Counter, AtomicInteger for e.g., is based on CAS, and usually try attempting to do operation until succeed. Basically, in this case, threads race or compete to increment\decrement the value atomically. Here, you may see good CPU cycles being used by a thread trying to operate on a current value.
Since you want to maintain the counter, AtomicInteger\AtomicLong will be the best for your use case.

Most processors have supported an atomic read or write, and often an atomic cmp&swap. This means that the processor itself writes or reads the latest value in a single operation, and there might be a few cycles lost compared to a normal integer access, especially as the compiler can't optimise around atomic operations nearly as well as normal.
On the other hand a mutex is a number of lines of code to enter and leave, and during that execution other processors that access the same location are totally stalled, so clearly a big overhead on them. In unoptimised high-level code, the mutex enter/exit and the atomic will be function calls, but for mutex, any competing processor will be locked out while your mutex enter function returns, and while your exit function is started. For atomic, it is only the duration of the actual operation which is locked out. Optimisation should reduce that cost, but not all of it.
If you are trying to increment, then your modern processor probably supports atomic increment/decrement, which will be great.
If it does not, then it is either implemented using the processor atomic cmp&swap, or using a mutex.
Mutex:
get the lock
read
increment
write
release the lock
Atomic cmp&swap:
atomic read the value
calc the increment
do{
atomic cmpswap value, increment
recalc the increment
}while the cmp&swap did not see the expected value
So this second version has a loop [incase another processor increments the value between our atomic operations, so value no longer matches, and increment would be wrong] that can get long [if there are many competitors], but generally should still be quicker than the mutex version, but the mutex version may allow that processor to task switch.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string