C11/C++11 memory model acquire, release, relaxed specifics

C11/C++11 memory model acquire, release, relaxed specifics - multithreading

I have some doubts about the C++11/C11 memory model that I was wondering if anyone can clarify. These are questions about the model/abstract machine, not about any real architecture.
Are acquire/release effects guaranteed to "cascade" from one thread to the next?
Here is a pseudo code example of what I mean (assume all variables start as 0)
[Thread 1]
store_relaxed(x, 1);
store_release(a, 1);
[Thread 2]
while (load_acquire(a) == 0);
store_release(b, 1);
[Thread 3]
while (load_acquire(b) == 0);
assert(load_relaxed(x) == 1);
Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct? Or do we need to use seq cst here in order to be guaranteed that the assert will not fire? I have a feeling acquire/release is enough, but I can't quite find any simple explanation that guarantees it. Most explanations of acquire/release mainly focus on the acquiring thread receiving all the stores made by the releasing thread. However in the example above, Thread 2 never touches variable x, and Thread 1/Thread 3 do not touch the same atomic variable. It's obvious that if Thread 2 were to load x, it would see 1, but is that state guaranteed to cascade over into other threads which subsequently do an acquire/release sync with Thread 2? Or does Thread 3 also need to do an acquire on variable a in order to receive Thread 1's write to x?
According to https://en.cppreference.com/w/cpp/atomic/memory_order:
All writes in the current thread are visible in other threads that acquire the same atomic variable
All writes in other threads that release the same atomic variable are visible in the current thread
Since Thread 1 and Thread 3 don't touch the same atomic variable, I'm not sure if acquire/release alone is enough for the above case. There's probably an answer hiding in the formal description, but I can't quite work it out.
*EDIT: Didn't notice until after the fact, but there is an example at the link I posted ("The following example demonstrates transitive release-acquire ordering...") that is almost the same as my example, but it uses the same atomic variable across all three threads, which seems like it might be significant. I am specifically asking about the case where the variables are not the same.
Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?
Imagine there is a function "get_data" that allocates a buffer, writes some data to it, and returns a pointer to the buffer. And there is a function "use_data" that takes the pointer to the buffer and does something with the data. Thread 1 gets a buffer from get_data and passes it to Thread 2 using a relaxed atomic store to a global atomic pointer. Thread 2 does relaxed atomic loads in a loop until it gets the pointer, and then passes it off to use_data:
int* get_data() {...}
void use_data(int* buf) {...}
int* global_ptr = nullptr;
[Thread 1]
int* buf = get_data();
super_duper_memory_fence();
store_relaxed(global_ptr, buf);
[Thread 2]
int* buf = nullptr;
while ((buf = load_relaxed(global_ptr)) == nullptr);
use_data(buf);
Is there any kind of operation at all that can be put in "super_duper_memory_fence", that will guarantee that by the time use_data gets the pointer, the data in the buffer is also visible? It is my understanding that there is not a portable way to do this, and that Thread 2 must have a matching fence or other atomic operation in order to guarantee that it receives the writes made into the buffer and not just the pointer value. Is this correct?

Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct?
Yes, this is correct. The acquire/release operations establish synchronize-with relations - i.e., store_release(a) synchronizes-with load_acquire(a) and store_release(b) synchronizes-with load_acquire(b). And load_acquire(a) is sequenced-before store_release(b). synchronize-with and sequenced-before are both part of the happens-before definition, and the happens-before relation is transitive. Therefore, store_relaxed(x, 1) happens-before load_relaxed(x).
Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?
This is question is a bit too broad, but overall I would tend to say "yes". In general you have to ensure that there is a proper happens-before relation when operating on some (non-atomic) shared data. If one thread writes to some shared data and some other thread should read that data, you have to ensure that the write happens-before the read. There are different ways to achieve this - atomics with the correct memory orderings are just one way (although one could argue that almost all other methods (like std::mutex) also boil down to atomic operations).
Fences also have to be combined with other fences or atomic operations. Your example would work if super_duper_memory_fence() were a std::atomic_thread_fence(std::memory_order_release) and you put another std::atomic_thread_fence(std::memory_order_acquire) before your call to use_data.
For more details I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers

Related

How does a Mutex work? Does a mutex protect variables globally? Does the scope in which it is defined matter?

Does a mutex lock access to variables globally, or just those in the same scope as the locked mutex?
Note that I had to change the title of this question, as a lot of answers seem to be confused as to what I was asking. This is not a question about the scope (global or otherwise) of a "mutex object", it is a question about what scope of variables are "locked" by a mutex.
I believe the answer to be that a mutex locks access to all variables, ie; all global and locally scoped variables. (This is a result of a mutex blocking thread execution rather than access to specific regions of memory.)
I am attempting to understand Mutexes.
I was attempting to understand what sections of memory, or equivalently, which variables, a mutex would lock.
However my understanding from reading around online is that Mutexes do not lock memory, they lock (or block) simultaneously running threads which are all members of the same process. (Is that correct?)
https://mortoray.com/2011/12/16/how-does-a-mutex-work-what-does-it-cost/
So my question has become simply "are mutexes global?"
... or are they perhaps "generally speaking global, but the stackoverflow community can imagine some special cases in which they are not?"
When originally considering my question, I was interested in things such as those shown in the following example.
// both in global scope, this mutex will lock any global scope variable?
int global_variable;
mutex global_variable_mutex;
int main()
{
// one thread operates here and locks global_variable_mutex
// before reading/writing
{
// local variables in a loop
// launch some threads here, and wait later
int local_variable;
mutex local_variable_mutex;
// wait for launched thread to return
// does the mutex here prevent data races to the variable
// global_variable ???
}
}
One may assume this is pseudo-code for C++ or C, or any other similarly relevant language.
2021 edit: Question title has been changed to better reflect the contents of the question and associated answers.

So my question has become simply "are mutexes global?"
No. A mutex has a lock() and an unlock() method, and the only thing a mutex does is cause its lock() call (from any thread) not to return for as long as another thread has that mutex locked. When the thread that was holding the mutex locked calls unlock(), that is when the lock() call will return in the first thread. That way it is guaranteed that only a single thread will be holding the mutex-lock (i.e. executing in the region between its lock() call and its unlock() call) at any given time.
That's really all there is to it. So a mutex will effect only the threads that call lock() on that particular mutex, and nothing else.
Mutex stands for "Mutual Exclusion" - using one correctly ensures that only one thread at a time will ever be executing any "critical section" protected by the same mutex.
If there are some variables you only ever modify inside critical sections protected by the same mutex, your code doesn't have a data race. No matter whether they're global, static, or pointed to by different variables in different threads or any other way two threads might have a reference to the same object.

When I asked this question I was confused...
When I originally asked this question, I was confused because I had no conceputal understanding of how a "mutex" functions in hardware, whereas I did have a conceptual understanding of many other things that exist in hardware. (For example, how a compiler converts text into machine readable instructions. How cache and memory work. How graphics or coprocessors work. How network hardware and interfaces work, etc.)
Misconception 1: Mutex does not lock memory locations
When I first heard about Mutex, long before writing this question, I misunderstood a mutex to be a feature which locks regions of memory. (That region might be global.)
This is not what happens. Other threads and processes can continue to access main memory and cache if another thread locks a mutex. You can see immediatly why such a design would be inefficient, since it would block all other system processes, for the sake of synchronizing one.
Misconception 2: The scope in which a mutex object is declared is irrelevant
The context of this is C code, and C like languages where you have scoped blocks defined by { and } however the same logic could apply to Python where scope is defined by indentation.
I believe that this misunderstanding came from the existance of scoped_lock objects, and similar concepts where scope is used to manage the lifetime (locking and unlocking, resources) of a Mutex object.
One could also argue that since pointers and references to a Mutex can be passed around a program, the scope of a Mutex couldn't be used to define what variables are "locked" by a mutex.
For example, I misunderstood the following snippet:
{
int x, y, z;
Mutex m;
m.lock();
}
I believed that the above snippet would lock access to variables x, y and z from all other threads because x, y and z are declared in the same scope as the mutex m. This is also not how a mutex works.
Understanding 1: Mutex is typically implemented in hardware using atomic operations
Atomic operations are completely seperate from the concept of mutex, however they are a prerequisite to understanding how a mutex can exist, and how it can work.
When a CPU executes something like c = a + b, this involves a sequence of individual (atomic) operations. The word Atom is derived from Atomos meaning "indivisible", or "fundamental". (Atoms are divisible, but when theorists of Ancient Greece originally concieved of the objects from which matter was composed, they assumed that particles must be divisible down to some fundamental smallest possible component, which itself is indivisible. They were not too far wrong, since an atom is made from other fundamental particles which so far we understand to be indivisible.)
Returning to the point: c = a + b is something like the following:
load a from memory into register 1
load b from memory into register 2
do operation add: add contents of register 2 to register 1, result is in register 1
save register 1 to memory c
The add operation might take several clock cycles, and loading/saving to memory takes typically of order 100 clock cycles on modern x86 machines. However each operation is atomic in the sense that a single CPU instruction is being completed, and this instruction cannot be divided into any smaller step of smaller instructions. The instructions are themselves fundamental computing operations.
With that understood, there exists a set of atomic instructions which can do things such as:
load a value from memory increment it and save it to memory
load a value from memory decrement it and save it to memory
load a value from memory, compare it to a value which is already loaded into a register, and branch depending on the comparison result
Note that such operations are typically significantly slower than their non-atomic sequence counterparts. This is because optimizations such as pipelining are forfit when executing the above instructions. (I think?)
At this point my knowledge becomes a bit less accurate and more hand-wavey, but as far as I understand, these operations are typically implemented by having some digital logic inside the processor which blocks all other processes from running while these atomic operations (listed above) are executing.
Meaning: If there are 8 CPU cores running, if one core encounters an instruction like the above, it signals the other cores to stop running until it has finished that atomic operation. (It is at least something approximatly along these lines.)
Understanding 2: Actual mutex operation
Given the above, it is possible to implement a mutex using these atomic machine instructions. Other answers posted here suggest possible ways of doing it including something similar to reference counting. (Semaphore.)
How an acutal mutex in C++ works is this:
Each mutex object has a variable in memory associated with it, the value of this variable indicates whether a mutex is locked or not
This mutex variable is updated using the special atomic operations that a CPU supports for the purpose of allowing a mutex to be programmed
Elsewhere in memory there are some other variables/data which you want to protect/synchronize access to
This synchronization is done using the mutex variable/data
Before a thread reads/writes to some data/variable which needs to be accessed mutually exclusively by all threads which operate on it, that thread must first "lock" the special mutex data/variable
This is done using the atomic operations built into a CPU for the purpose of supporting mutex programming
So you see, the data which is "locked" and accessed mutually exclusively is entirely independent from the actual data used to store the state of the mutex.
If another thread wants to read/write the data which must be accessed mutually exclusively, it will try to lock the mutex. If the mutex is already locked, that means another thread has the right to access this data, and no other thread is permitted to, therefore this thread will typically go to sleep, and will be re-woken by the operating system when the mutex is next unlocked.
It is important to note the operating system thread (kernel) is critically involved in the mutex process. Typically, before a thread sleeps, it will tell the operating sytem that it wishes to be woken up again when the mutex is free. The operating system is also notified when other threads lock or unlock a mutex. Hence synchronization of information about the state of a mutex is passed via messages through the operating system kernel.
This is why writing a multiple thread OS kernel is (proabably) impossible (if not very difficult). I don't know if this has actually been done successfully. It sounds like a difficult problem which might be the subject of current CS research.
This is pretty much everything I know about the subject. Obviously my knowledge is not absolute...
Note: Feel free to correct my Greek history or x86 Machine Instruction knowledge in the comments section. No doubt not everything here is perfectly accurate.

As your question suggests, I assume you are asking your question independent of any programming language.
First it is important to understand what is a mutex and how it works? A mutex is a binary semaphore. Then what is a semaphore? A semaphore is an integer with following attributes,
You can initialize it into any permitted value (For a mutex, it is 1 or 0).
A thread can access the semaphore and it can increment or decrement its integer value.
When a thread decrements it,
If the result is positive or zero, that thread can continue its process.
If the result is negative, that thread will be waiting and the semaphore value will not be further decremented by any later thread.
If a thread increments it, (in that case semaphore value will be either positive or 0) and the result is 0, one of the waiting threads can continue execution.
So when there's a situation where a thread is trying to access a shared resource it will decrement the mutex value (from 0, so that other thread is waiting). And when it finishes, it will increment the mutex value (So that the waiting thread can continue). That's how the access control happens by means of a mutex (Binary semaphore).
I think you understand that your question is a non-applicable one here. As a simple answer for
So my question has become simply "are mutexes global?"
is simply NO.

A mutex has whatever scope you assign to it. It can be global or local again based on where and how you declare it. If for example you declare a mutex in global memory in a place where you can access it globally, then it is indeed global. If instead you declare it at function or private class scope level, then only that function or class will have access to it.
That said, in order to be useful for synchronization, the mutex needs to be declared in a scope that can be accessed by the threads needing to synchronize on it. Whether that's at global scope or some local scope depends on your program structure. I'd advise declaring it at the highest scope accessible to the threads but no higher.
In your particular example, the mutex is indeed global because you've declared it in global memory.

Locking doesn't operate on the variables it protects, it just works by giving threads a way to arrange that only one thread at a time will be doing something (like reading+writing a data structure). And that it will be finished, with memory effects visible, before the next thread's turn to read and maybe modify that data. (A readers+writers lock allows multiple readers but only one writer).
Any thread that can access the mutex object can lock / unlock it. The mutex object itself is a normal variable that you can put in any scope you want, even a local variable and then put a pointer to it somewhere that other threads can see. (Although normally you wouldn't do that.)
Mutex is named for "Mutual Exclusion" - using one correctly ensures that only one thread at a time will ever be executing any "critical section" (wikipedia) protected by the same mutex. Separate mutexes can allow different threads to hold different locks. Different functions or blocks that use the same mutex (normally because they access the same data) won't both run at once.
If there are some variables you only ever modify inside critical sections protected by the same mutex, those accesses won't be data race, and if you don't have other bugs, your code is thread-safe. No matter whether they're global, static, or pointed to by different variables in different threads or any other way two threads might have a reference to the same object.
If you write code that accesses shared data without taking a lock on a mutex, it might see a partially-updated value, especially for a struct with multiple pointers / integers. (And in C++, simultaneous accesses to non-atomic variables is undefined behaviour if they're not all reads).
Locking is a cooperative activity, normally nothing stops you from getting it wrong. If you're familiar with file locking, you may have heard of advisory vs. mandatory locks (the OS will deny open calls by other programs). Mutexes in multi-threaded programs are advisory; no memory protection or other hardware mechanism stops another thread from executing code that accesses the bytes of an object.
(At a low enough level, that's actually useful for lock-free atomics, especially with some control over ordering of those operations from memory barriers and/or release-store / acquire-load. And CPU cache hardware is up to the task of maintaining coherency from multiple accesses. But if you use locking, you don't have to worry about any of that. If you use locking incorrectly, understanding the possible symptoms might help identify that there is a locking problem.)
Some programs have phases where only a single thread is running, or only one that would need to touch certain variables, so enforced locking for every access to a variable isn't something that every language provides. (C++ std::atomic<T> is sort of like that; every access is as-if there was a lock/unlock of a lock protecting just that T object, except it's limited to operations that most CPUs can do without needing to lock/unlock a separate lock. Unless you use a large T, then there actually is a lock. Or if you use a memory order weaker than the default seq_cst, you can see orderings that wouldn't have been possible if all accesses acquiring/releasing locks.)
Besides, consistency between multiple variables is often important, so it matters that you hold one lock across multiple operations on multiple variables, or multiple members of the same struct.
Some tools can help detect code that doesn't respect a mutex while other threads are running, though, like clang -fsanitize=thread.

The usage case of counting semaphore

To be clear: I do mostly embedded stuff, i.e. it's C and some kind of real-time kernel in microcontroller; but actually this question should be platform-independent.
I've read nice article by Michael Barr: Mutexes and Semaphores Demystified, as well as this related answer on StackOverflow. I understand clearly what binary semaphore is for, and what mutex is for. That's great.
But to be honest I never knew, and still can't understand, what so-called counting semaphore (i.e. semaphore with max count > 1) is for. In what cases should I use it?
Long time ago, before I've read aforementioned article by Michael Barr, I've told something like "you can use it when you have, like, a hotel room with certain number of beds. The number of beds is a maximum count for the semaphore, just like a number of keys for that room".
It probably sounds nicely, but actually I never had such a situation in my programming practice (and can't imagine any), and Michael Barr said this approach is just wrong, and he seems right.
Then, after I've read the article, I supposed it might probably be used when I have, say, some kind of FIFO buffer. Assume the buffer's capacity is 10 elements, and we have two tasks: A (the producer), and B (the consumer). Then:
Semaphore's max count should be set to 10;
When A wants to put data into buffer, it signals the semaphore.
When B wants to get the data from buffer, it waits the semaphore.
Well, but it doesn't work:
What if A tries to put new data to the FIFO, but there is no room? How would it wait for the place: should it call signal before putting new data (and signal then should be able to wait until max count < max count)? If so, semaphore will be signaled before data is actually put in the FIFO, this is wrong.
Semaphore is not enough for the proper synchronization: the FIFO itself needs to be synchronized as well. And then, it produces classic TOCTTOU problem: there is a period of time while semaphore is already either signaled or waited, but FIFO isn't yet modified.
So, when should I use that beast, the counting semaphore?

The 'classic' example is, indeed, a producer-consumer queue.
An unbounded queue requires one semaphore, (to count the queue entries), and a mutex-protected thread-safe queue, (or equivalent lock-free thread-safe queue). The semaphore is intialized to zero. Producers lock the mutex, push an object onto the queue, unlock the mutex and signal the semaphore. Consumers wait on the semaphore, lock the mutex, pop the object and unlock the mutex.
An bounded queue requires two semaphores, (one 'count' to count the entries, the other 'available' to count the free space), and a mutex-protected thread-safe queue, (or equivalent lock-free thread-safe queue). 'count' is initialized to zero and 'available' to the number of spaces free in an empty queue. Producers wait for 'available', lock the mutex, push an object onto the queue, unlock the mutex and signal 'count'. Consumers wait on 'count', lock the mutex, pop the object, unlock the mutex and signal 'available'.
This is a classic use for semaphores and had been around since forever, (well, since Dijkstra, anyway:). It's been tried billions of times, and it works fine for any number of producers/consumers.
There is no TOCTTOU issue, no corner-cases, no races.
The 'mutex' functionality may be provided by yet another semaphore, initialized to 1. This allows 'two semaphore' unbounded, and 'three semaphore' bounded implementations.

I supposed it might probably be used when I have, say, some kind of FIFO buffer. Assume the buffer's capacity is 10 elements, and we have two tasks: A (the producer), and B (the consumer). Then:
Semaphore's max count should be set to 10;
When A wants to put data into buffer, it signals the semaphore.
When B wants to get the data from buffer, it waits the semaphore.
This is not the way semaphores are used in the producer-consumer scenario. The standard solution is to use two counting semaphores, one for the empty slots (initialized to the number of available slots), and another for the filled slots (initialized to 0).
Producers try to allocate empty slots to put items in, so they start with wait-ing on the semaphore assigned to the empty slots. Consumers try to "allocate" (get hold of) filled slots, so they start with wait-ing on the semaphore assigned to the filled slots.
After finishing their work, they both signal the other semaphore since they transform slots from empty to filled and from filled to empty, respectively.
Standard solution scheme:
semaphore mutex = 1;
semaphore filled = 0;
semaphore empty = SIZE;
producer() {
while ( true) {
item = produceItem();
wait(empty);
wait(mutex);
putItemIntoBuffer( item);
signal(mutex);
signal(filled);
}
}
consumer() {
while ( true) {
wait( filled);
wait( mutex);
item = removeItemFromBuffer();
signal( mutex);
signal( empty);
consumeItem( item);
}
}
I think counting semaphores serve well in this situation.
Another, maybe simpler, example could be using a counting semaphore for avoiding deadlock in the Dining philosophers scenario. Since deadlock can occur only when all philosophers sit down simultaneously and pick their (say) left fork, deadlock can be avoided by not allowing all of them into the dining room at the same time. This can be achieved by a counting semaphore (enter) initialized to one less than the number of philosophers.
The protocol of one philosopher then becomes:
wait( enter)
wait( left_fork)
wait( right_fork)
eat()
signal( left_fork)
signal( right_fork)
signal( enter)
This ensures that all philosophers cannot be in the dining room at the same time.

Some of the more popular use cases of counting semaphores are -
Limiting the number of connections in a JDBC connection pool.
Network connection throttling.
Limiting concurrent access to resources such as a disk.

mutex concept in POSIX threads

could anyone tell me how to use the mutex in the posix threads.
we are decalaring a mutex as pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER..
and we are using pthread_mutex_lock(&mutex1) to lock the critical code.
but my question is what does it lock
we didnt specify which code to be locked
if i have 4 threads and i want the thread 1 resources to be locked how could i tell that to the compiler
and im unable to understand it can anyone help me guys.
and whats the real use of mutex
waiting for an answer ..

It's entirely up to you to decide, pthreads doesn't specify what a lock does or does not protect.
E.g. if you want to protect a variable "a", it's up to you to ensure that all accesses to "a" happen under the protection of a mutex (say, "a_lock"). For a simple (untested) example
static int a;
static pthread_mutex_t a_lock = PTHREAD_MUTEX_INITIALIZER;
/* Add argument to static variable a in a thread-safe manner and return the result. (In real life, for this kind of operation you could use an atomic operation) */
int increment_a(int i)
{
int res;
pthread_mutex_lock(&a_lock);
a += i;
res = a;
pthread_mutex_unlock(&a_lock);
return res;
}
That is, there is nothing in the hardware, OS, thread library, or anything else, that specifies any kind of relationship between the data you want to protect (variable "a" in the example above), and the mutex you're using to implement said protection (the mutex "a_lock" above). The pthread library only ensures that only one thread at a time may hold a mutex.

The code it locks, protects, its the code between the lock and the unlock
pthread_mutex_lock(&mtx)
global_var = global_var + 10;
pthread_mutex_unlock(&mtx)
In this simple case you are protecting the increments of this global variable, this is necessary because the process of incrementing a number takes three steps:
Reading the current value from ram
Incrementing that value
Writing that value back into ram
Let's focus on steps 1 and 2. A thread CAN (and will be, be sure of that) be switched just after step 1 (reading), and the other thread will read the same value as the first one, because it never got the opportunity to update back to ram. When the first thread resumes work (just after step 1), it will increment that value it read before and write back the wrong final value to ram, overwriting the second's thread work.
In other words:
[Thread 1] Reads the value 10
[Thread 2] Reads the value 10
[Thread 2] Increments it by 5 (value is now 15)
[Thread 2] Writes it back
[Thread 1] Increments it by 5 **(value is now 15)**
[Thread 1] Writes it back
Final value: 15
The final value should have been 20, not 15. See the problem?
With a mutex you can serialize the access, putting stuff into proper order, so it becomes
[Thread 1] Reads the value 10
[Thread 1] Increments it by 5 (value is now 15)
[Thread 1] Writes it back
[Thread 2] Reads the value 15
[Thread 2] Increments it by 5 (value is now 20)
[Thread 2] Writes it back
Final value: 20
I'm sorry if I was confusing, but thread programming is confusing at first. Takes a while to think multithreaded, but linger there! It is a very important concept to grasp, especially now that multi-core CPUs are the de facto standard.

Threads share memory (except the stack and possibly thread-local storage). So, you need to do something to avoid two threads stomping into each other.
Mutexes are a way to prevent threads from interfering into each other. A mutex is a MUTual EXClusion primitive, only one thread can hold a given mutex at a time.
So, if you want to protect some data structure from simultaneous access from several threads, you associate a mutex with that thread, and wrap every access to that data structure into mutex lock and unlock calls. This way, you ensure that only one thread can access the data structure at a time.
If a thread holds a mutex, and a second thread tries to lock the mutex, the second thread will block (sleep) until the first thread unlocks the mutex.

When should we use mutex and when should we use semaphore

When should we use mutex and when should we use semaphore ?

Here is how I remember when to use what -
Semaphore:
Use a semaphore when you (thread) want to sleep till some other thread tells you to wake up. Semaphore 'down' happens in one thread (producer) and semaphore 'up' (for same semaphore) happens in another thread (consumer)
e.g.: In producer-consumer problem, producer wants to sleep till at least one buffer slot is empty - only the consumer thread can tell when a buffer slot is empty.
Mutex:
Use a mutex when you (thread) want to execute code that should not be executed by any other thread at the same time. Mutex 'down' happens in one thread and mutex 'up' must happen in the same thread later on.
e.g.: If you are deleting a node from a global linked list, you do not want another thread to muck around with pointers while you are deleting the node. When you acquire a mutex and are busy deleting a node, if another thread tries to acquire the same mutex, it will be put to sleep till you release the mutex.
Spinlock:
Use a spinlock when you really want to use a mutex but your thread is not allowed to sleep.
e.g.: An interrupt handler within OS kernel must never sleep. If it does the system will freeze / crash. If you need to insert a node to globally shared linked list from the interrupt handler, acquire a spinlock - insert node - release spinlock.

A mutex is a mutual exclusion object, similar to a semaphore but that only allows one locker at a time and whose ownership restrictions may be more stringent than a semaphore.
It can be thought of as equivalent to a normal counting semaphore (with a count of one) and the requirement that it can only be released by the same thread that locked it(a).
A semaphore, on the other hand, has an arbitrary count and can be locked by that many lockers concurrently. And it may not have a requirement that it be released by the same thread that claimed it (but, if not, you have to carefully track who currently has responsibility for it, much like allocated memory).
So, if you have a number of instances of a resource (say three tape drives), you could use a semaphore with a count of 3. Note that this doesn't tell you which of those tape drives you have, just that you have a certain number.
Also with semaphores, it's possible for a single locker to lock multiple instances of a resource, such as for a tape-to-tape copy. If you have one resource (say a memory location that you don't want to corrupt), a mutex is more suitable.
Equivalent operations are:
Counting semaphore Mutual exclusion semaphore
-------------------------- --------------------------
Claim/decrease (P) Lock
Release/increase (V) Unlock
Aside: in case you've ever wondered at the bizarre letters (P and V) used for claiming and releasing semaphores, it's because the inventor was Dutch. In that language:
Probeer te verlagen: means to try to lower;
Verhogen: means to increase.
(a) ... or it can be thought of as something totally distinct from a semaphore, which may be safer given their almost-always-different uses.

It is very important to understand that a mutex is not a semaphore with count 1!
This is the reason there are things like binary semaphores (which are really semaphores with count 1).
The difference between a Mutex and a Binary-Semaphore is the principle of ownership:
A mutex is acquired by a task and therefore must also be released by the same task.
This makes it possible to fix several problems with binary semaphores (Accidental release, recursive deadlock, and priority inversion).
Caveat: I wrote "makes it possible", if and how these problems are fixed is up to the OS implementation.
Because the mutex has to be released by the same task it is not very good for the synchronization of tasks. But if combined with condition variables you get very powerful building blocks for building all kinds of IPC primitives.
So my recommendation is: if you got cleanly implemented mutexes and condition variables (like with POSIX pthreads) use these.
Use semaphores only if they fit exactly to the problem you are trying to solve, don't try to build other primitives (e.g. rw-locks out of semaphores, use mutexes and condition variables for these)
There is a lot of misunderstanding between mutexes and semaphores. The best explanation I found so far is in this 3-Part article:
Mutex vs. Semaphores – Part 1: Semaphores
Mutex vs. Semaphores – Part 2: The Mutex
Mutex vs. Semaphores – Part 3 (final part): Mutual Exclusion Problems

While #opaxdiablo answer is totally correct I would like to point out that the usage scenario of both things is quite different. The mutex is used for protecting parts of code from running concurrently, semaphores are used for one thread to signal another thread to run.
/* Task 1 */
pthread_mutex_lock(mutex_thing);
// Safely use shared resource
pthread_mutex_unlock(mutex_thing);
/* Task 2 */
pthread_mutex_lock(mutex_thing);
// Safely use shared resource
pthread_mutex_unlock(mutex_thing); // unlock mutex
The semaphore scenario is different:
/* Task 1 - Producer */
sema_post(&sem); // Send the signal
/* Task 2 - Consumer */
sema_wait(&sem); // Wait for signal
See http://www.netrino.com/node/202 for further explanations

See "The Toilet Example" - http://pheatt.emporia.edu/courses/2010/cs557f10/hand07/Mutex%20vs_%20Semaphore.htm:
Mutex:
Is a key to a toilet. One person can have the key - occupy the toilet - at the time. When finished, the person gives (frees) the key to the next person in the queue.
Officially: "Mutexes are typically used to serialise access to a section of re-entrant code that cannot be executed concurrently by more than one thread. A mutex object only allows one thread into a controlled section, forcing other threads which attempt to gain access to that section to wait until the first thread has exited from that section."
Ref: Symbian Developer Library
(A mutex is really a semaphore with value 1.)
Semaphore:
Is the number of free identical toilet keys. Example, say we have four toilets with identical locks and keys. The semaphore count - the count of keys - is set to 4 at beginning (all four toilets are free), then the count value is decremented as people are coming in. If all toilets are full, ie. there are no free keys left, the semaphore count is 0. Now, when eq. one person leaves the toilet, semaphore is increased to 1 (one free key), and given to the next person in the queue.
Officially: "A semaphore restricts the number of simultaneous users of a shared resource up to a maximum number. Threads can request access to the resource (decrementing the semaphore), and can signal that they have finished using the resource (incrementing the semaphore)."
Ref: Symbian Developer Library

Mutex is to protect the shared resource.
Semaphore is to dispatch the threads.
Mutex:
Imagine that there are some tickets to sell. We can simulate a case where many people buy the tickets at the same time: each person is a thread to buy tickets. Obviously we need to use the mutex to protect the tickets because it is the shared resource.
Semaphore:
Imagine that we need to do a calculation as below:
c = a + b;
Also, we need a function geta() to calculate a, a function getb() to calculate b and a function getc() to do the calculation c = a + b.
Obviously, we can't do the c = a + b unless geta() and getb() have been finished.
If the three functions are three threads, we need to dispatch the three threads.
int a, b, c;
void geta()
{
a = calculatea();
semaphore_increase();
}
void getb()
{
b = calculateb();
semaphore_increase();
}
void getc()
{
semaphore_decrease();
semaphore_decrease();
c = a + b;
}
t1 = thread_create(geta);
t2 = thread_create(getb);
t3 = thread_create(getc);
thread_join(t3);
With the help of the semaphore, the code above can make sure that t3 won't do its job untill t1 and t2 have done their jobs.
In a word, semaphore is to make threads execute as a logicial order whereas mutex is to protect shared resource.
So they are NOT the same thing even if some people always say that mutex is a special semaphore with the initial value 1. You can say like this too but please notice that they are used in different cases. Don't replace one by the other even if you can do that.

Trying not to sound zany, but can't help myself.
Your question should be what is the difference between mutex and semaphores ?
And to be more precise question should be, 'what is the relationship between mutex and semaphores ?'
(I would have added that question but I'm hundred % sure some overzealous moderator would close it as duplicate without understanding difference between difference and relationship.)
In object terminology we can observe that :
observation.1 Semaphore contains mutex
observation.2 Mutex is not semaphore and semaphore is not mutex.
There are some semaphores that will act as if they are mutex, called binary semaphores, but they are freaking NOT mutex.
There is a special ingredient called Signalling (posix uses condition_variable for that name), required to make a Semaphore out of mutex.
Think of it as a notification-source. If two or more threads are subscribed to same notification-source, then it is possible to send them message to either ONE or to ALL, to wakeup.
There could be one or more counters associated with semaphores, which are guarded by mutex. The simple most scenario for semaphore, there is a single counter which can be either 0 or 1.
This is where confusion pours in like monsoon rain.
A semaphore with a counter that can be 0 or 1 is NOT mutex.
Mutex has two states (0,1) and one ownership(task).
Semaphore has a mutex, some counters and a condition variable.
Now, use your imagination, and every combination of usage of counter and when to signal can make one kind-of-Semaphore.
Single counter with value 0 or 1 and signaling when value goes to 1 AND then unlocks one of the guy waiting on the signal == Binary semaphore
Single counter with value 0 to N and signaling when value goes to less than N, and locks/waits when values is N == Counting semaphore
Single counter with value 0 to N and signaling when value goes to N, and locks/waits when values is less than N == Barrier semaphore (well if they dont call it, then they should.)
Now to your question, when to use what. (OR rather correct question version.3 when to use mutex and when to use binary-semaphore, since there is no comparison to non-binary-semaphore.)
Use mutex when
1. you want a customized behavior, that is not provided by binary semaphore, such are spin-lock or fast-lock or recursive-locks.
You can usually customize mutexes with attributes, but customizing semaphore is nothing but writing new semaphore.
2. you want lightweight OR faster primitive
Use semaphores, when what you want is exactly provided by it.
If you dont understand what is being provided by your implementation of binary-semaphore, then IMHO, use mutex.
And lastly read a book rather than relying just on SO.

I think the question should be the difference between mutex and binary semaphore.
Mutex = It is a ownership lock mechanism, only the thread who acquire the lock can release the lock.
binary Semaphore = It is more of a signal mechanism, any other higher priority thread if want can signal and take the lock.

All the above answers are of good quality,but this one's just to memorize.The name Mutex is derived from Mutually Exclusive hence you are motivated to think of a mutex lock as Mutual Exclusion between two as in only one at a time,and if I possessed it you can have it only after I release it.On the other hand such case doesn't exist for Semaphore is just like a traffic signal(which the word Semaphore also means).

As was pointed out, a semaphore with a count of one is the same thing as a 'binary' semaphore which is the same thing as a mutex.
The main things I've seen semaphores with a count greater than one used for is producer/consumer situations in which you have a queue of a certain fixed size.
You have two semaphores then. The first semaphore is initially set to be the number of items in the queue and the second semaphore is set to 0. The producer does a P operation on the first semaphore, adds to the queue. and does a V operation on the second. The consumer does a P operation on the second semaphore, removes from the queue, and then does a V operation on the first.
In this way the producer is blocked whenever it fills the queue, and the consumer is blocked whenever the queue is empty.

A mutex is a special case of a semaphore. A semaphore allows several threads to go into the critical section. When creating a semaphore you define how may threads are allowed in the critical section. Of course your code must be able to handle several accesses to this critical section.

I find the answer of #Peer Stritzinger the correct one.
I wanted to add to his answer the following quote from the book Programming with POSIX Threads by David R Butenhof. On page 52 of chapter 3 the author writes (emphasis mine):
You cannot lock a mutex when the calling thread already has that mutex locked. The result of attempting to do so may be an error return (EDEADLK), or it may be a self-deadlock, where the unfortunate thread waits forever. You cannot unlock a mutex that is unlocked, or that is locked by another thread. Locked mutexes are owned by the thread that locks them. If you need an "unowned" lock, use a semaphore. Section 6.6.6 discusses semaphores)
With this in mind, the following piece of code illustrates the danger of using a semaphore of size 1 as a replacement for a mutex.
sem = Semaphore(1)
counter = 0 // shared variable
----
Thread 1
for (i in 1..100):
sem.lock()
++counter
sem.unlock()
----
Thread 2
for (i in 1..100):
sem.lock()
++counter
sem.unlock()
----
Thread 3
sem.unlock()
thread.sleep(1.sec)
sem.lock()
If only for threads 1 and 2, the final value of counter should be 200. However, if by mistake that semaphore reference was leaked to another thread and called unlock, than you wouldn't get mutual exclusion.
With a mutex, this behaviour would be impossible by definition.

Binary semaphore and Mutex are different. From OS perspective, a binary semaphore and counting semaphore are implemented in the same way and a binary semaphore can have a value 0 or 1.
Mutex -> Can only be used for one and only purpose of mutual exclusion for a critical section of code.
Semaphore -> Can be used to solve variety of problems. A binary semaphore can be used for signalling and also solve mutual exclusion problem. When initialized to 0, it solves signalling problem and when initialized to 1, it solves mutual exclusion problem.
When the number of resources are more and needs to be synchronized, we can use counting semaphore.
In my blog, I have discussed these topics in detail.
https://designpatterns-oo-cplusplus.blogspot.com/2015/07/synchronization-primitives-mutex-and.html

Cache consistency & spawning a thread

Background
I've been reading through various books and articles to learn about processor caches, cache consistency, and memory barriers in the context of concurrent execution. So far though, I have been unable to determine whether a common coding practice of mine is safe in the strictest sense.
Assumptions
The following pseudo-code is executed on a two-processor machine:
int sharedVar = 0;
myThread()
{
print(sharedVar);
}
main()
{
sharedVar = 1;
spawnThread(myThread);
sleep(-1);
}
main() executes on processor 1 (P1), while myThread() executes on P2.
Initially, sharedVar exists in the caches of both P1 and P2 with the initial value of 0 (due to some "warm-up code" that isn't shown above.)
Question
Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
With my newfound knowledge of processor caches, it seems entirely possible that at the time of the print() statement, P2 may not have received the invalidation request for sharedVar caused by P1's assignment in main(). Therefore, it seems possible that myThread() could print 0.
References
These are the related articles and books I've been reading:
Shared Memory Consistency Models: A Tutorial
Memory Barriers: a Hardware View for Software Hackers
Linux Kernel Memory Barriers
Computer Architecture: A Quantitative Approach

Strictly speaking – preferably without assuming any particular type of CPU – is myThread() guaranteed to print 1?
Theoretically, it can print either 0 or 1, even on x86, since stores can move after loads on almost any architecture.
In practice, it would be hard to make myThread() print 0.
Spawning a thread will most likely function as an implicit store/release memory barrier, since it would probably:
- have at least one instruction along the execution path that results in a memory barrier - interlocked instructions, explicit memory barrier instructions, etc.,
- or the store would simply be retired/drained from the store buffer by the time myThread() is called, since setting up a new thread results in executing many instructions - among them, many stores.

I'll speak only to Java here: myThread() is guaranteed to print 1, due to the happens before definition from the The Java Language Specification (Section 17.4.5).
The write to sharedVar in main() happens before spawning the thread with function myThread() because the variable assignment comes first in program order. Next, spawning a thread happens before any actions in the thread being started. By the transitivity of the definition in Section 17.4.5 (hb(x, y) and hb(y, z) implies hb(x, z)), writing to the variable sharedVar happens before print() reads sharedVar in myThread().
You might also enjoy reading Brian Goetz's article Java theory and practice: Fixing the Java Memory Model, Part 2 covering this subject, as well as his book Java Concurrency in Practice.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string