A multiprocessor architecture and Ring 3

A multiprocessor architecture and Ring 3 - multithreading

void
sema_down (struct semaphore *sema)
{
old_level = intr_disable ();
while (sema->value == 0)
{
list_push_back (&sema->waiters, &thread_current ()->elem);
thread_block ();
}
sema->value--;
intr_set_level (old_level);
}
The above piece of code is a mechanim locking mutex in PintOS. PintOS is targeted for uniprocessor systems. Because of that fact it is sufficient to just disable interrupts. There is no possibility that the other will take a mutex.
So, let's consider a multiprocessor design:
void
sema_down (struct semaphore *sema)
{
old_level = intr_disable ();
while (!lock cmpxchg(1,0)) // it is just pseudocode-idea
{
list_push_back (&sema->waiters, &thread_current ()->elem);
thread_block ();
}
intr_set_level (old_level);
}
old_level = intr_disable ();. It turned off interrupts but it is crucial only in that CPU's context.
It can be a prototype of function acquiring mutex in MP architecture. But, there is a problem with list_push_back. It must be also safe-multithreading. But, we cannot make it safe with a mutex because we are just implementing it now!
The main question is:
Is it possible that two ( or more) CPUs are executing code on Ring 0 level ( kernel)?
And, subquestions that are dependent on the answer to the first one:
If not, there is no problem I described above. But- how it can be implemented?
If yes ( it seems impossible or very hard to realize), what about my above considerations ( It is just only example of potential problem).
Do we have to use spinlocks or lock-free structures?

Yes, in SMP multiple CPU can execute the same code, even at Ring 0.
Every CPU is symmetrical thus it can execute the same code path as the others (including kernel code), unless the software implements some sort of synchronization.
The Linux kernel also faced this problem, and initially implemented a not-so-good solution: A Big Kernel Lock that was acquired and released upon entering and exiting the kernel.
It was not a good solution because only one CPU at a time could execute the kernel code, but it was quick to implement and it was the equivalent of the item number one in you list.
A better solution is to use finer locks across the whole kernel.
Since is it the kernel that implements the sleeping locks like mutexes or semaphores shown in your example, it cannot rely on those primitives themselves1 and must use spinlocks or other, simpler, form of locking.
Luckily this is not a problem, a spinlock (and its variants) is actually better than a mutex when there is a low contention or the critical path is really short (like updating a list).
You can take a look at mutex_init from Linux to see that a spinlock is used to synchronize the queue of waiting tasks.
49 void
50 __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
51 {
52 atomic_set(&lock->count, 1);
53 spin_lock_init(&lock->wait_lock);
54 INIT_LIST_HEAD(&lock->wait_list);
55 mutex_clear_owner(lock);
56 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
57 osq_lock_init(&lock->osq);
58 #endif
59
60 debug_mutex_init(lock, name, key);
61 }
So the answer to your second item is yes.
1 You can't sleep while waiting to sleep while waiting for a lock.

Related

Can OS actively switch to another thread when the current thread is blocking? If so, does it make the value of asynchronous programming much smaller?

Ok, let me try to restate the 2 questions:
Does OS actively preempt a thread as soon as it starts blocking, and never return to the thread until blocking is done? I feel that the OS has the information about disk IO and network IO so it should have enough information to do so.
If the OS can eliminate the CPU idle time by switching to another thread, do we really need asynchronous programming?

Therefore, even though the thread is blocking, the CPU is not blocking
but running other threads.
Yes, that's correct.
If my understanding above is correct, what is the value of
asynchronous programming?
It's useful if you want your program to make progress on multiple tasks simultaneously.
Of course you could get the same effect by explicitly spawning multiple threads yourself, and having each of them work on a separate task (including any blocking calls), and that would work as well, but then you have to explicitly manage all those threads which can be a bit of a pain to get right. (inter-thread synchronization/communication can be tricky, and in particular the case where you want to cancel an operation is difficult to implement well if one or more of your threads is blocked inside a blocking I/O call and thus can't be easily persuaded to exit quickly -- then your other threads may have to wait a long time, possibly forever, before they can join() that thread and terminate safely)

The re-invigoration of asynchronous programming has little to do with OS/kernel architecture to date. OSes have blocked the way you describe since the 1960s; although the reason has changed somewhat.
In early systems, efficiency was measured by useful CPU cycles; so switching tasks when one was blocked was a natural act.
Modern systems architecture is frequently addressing how to keep the CPUs occupied; for example if there are 800 CPUs but 20 tasks; then 780 CPUs have nothing to do.
As a concrete example, a program to sum all the bytes of a file might look a bit like:
for (count=0; (c = getchar()) != EOF; count += c) {}
A multi-threaded version, for performance increase might look like:
for (n=0; n < NCPU; n++) {
if (threadfork() == 0) {
offset_t o = n* (file_size / NCPU);
offset_t l = (file_size / NCPU);
for (count = 0; l-- && pread(fd, &c, 1, o) == 1; count += c) {}
threadexit(count);
}
}
for (n=0; n < NCPU; n++) {
threadwait(&temp);
total += temp;
}
return total;
which is a bit grim, both because it is complex, and probably has inconsistent speed-ups.
In comparison the function:
int GetSum(char *data, int len) {
int count = 0;
while (len--) {
count += *data++;
}
return count;
}
I could construct a sort of dispatcher which, when a lump of file data became available in ram, invoked GetSum() on it, queuing its return value for later accumulation. This dispatcher could invest in familiarity with optimal i/o patterns etc.. since it may be applicable to many problems; and the programmer has a considerably simpler job to do.
Even without that sort of native support; I could mmap(2) the file, then dispatch many threads to just touch a page, then invoke GetSum on that page. This would effectively emulate an asynchronous model in a plain old unix-y framework.
Of course nothing is quite that easy; even a progress bar in a dispatch-oriented asynchronous model is dubious at best (not that the 1950s- based sequential ones were anything to write home about ). Communicating errors is also cumbersome; and because you use asynch to direct maximum cpu resources at yourself, you need to minimize synchronization operations (duh, async :).
Async has a lot of possibilities; but it really needs languages with defacto async support, not as an aspirational nod from the latest du jour standard of some rickety language.

Will Go's scheduler yield control from one goroutine to another for CPU-intensive work?

The accepted answer at golang methods that will yield goroutines explains that Go's scheduler will yield control from one goroutine to another when a syscall is encountered. I understand that this means if you have multiple goroutines running, and one begins to wait for something like an HTTP response, the scheduler can use this as a hint to yield control from that goroutine to another.
But what about situations where there are no syscalls involved? What if, for example, you had as many goroutines running as logical CPU cores/threads available, and each were in the middle of a CPU-intensive calculation that involved no syscalls. In theory, this would saturate the CPU's ability to do work. Would the Go scheduler still be able to detect an opportunity to yield control from one of these goroutines to another, that perhaps wouldn't take as long to run, and then return control back to one of these goroutines performing the long CPU-intensive calculation?

There are few if any promises here.
The Go 1.14 release notes says this in the Runtime section:
Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.
A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. ...
I quoted part of the third paragraph here because this gives us a big clue as to how this asynchronous preemption works: the runtime system has the OS deliver some OS signal (SIGALRM, SIGVTALRM, etc.) on some sort of schedule (real or virtual time). This allows the Go runtime to implement the same kind of schedulers that real OSes implement with real (hardware) or virtual (virtualized hardware) timers. As with OS schedulers, it's up to the runtime to decide what to do with the clock ticks: perhaps just run the GC code, for instance.
We also see a list of platforms that don't do it. So we probably should not assume it will happen at all.
Fortunately, the runtime source is actually available: we can go look to see what does happen, should any given platform implement it. This shows that in runtime/signal_unix.go:
// We use SIGURG because it meets all of these criteria, is extremely
// unlikely to be used by an application for its "real" meaning (both
// because out-of-band data is basically unused and because SIGURG
// doesn't report which socket has the condition, making it pretty
// useless), and even if it is, the application has to be ready for
// spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more
// likely to be used for real.
const sigPreempt = _SIGURG
and:
// doSigPreempt handles a preemption signal on gp.
func doSigPreempt(gp *g, ctxt *sigctxt) {
// Check if this G wants to be preempted and is safe to
// preempt.
if wantAsyncPreempt(gp) && isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()) {
// Inject a call to asyncPreempt.
ctxt.pushCall(funcPC(asyncPreempt))
}
// Acknowledge the preemption.
atomic.Xadd(&gp.m.preemptGen, 1)
atomic.Store(&gp.m.signalPending, 0)
}
The actual asyncPreempt function is in assembly, but it just does some assembly-only trickery to save user registers, and then calls asyncPreempt2 which is in runtime/preempt.go:
//go:nosplit
func asyncPreempt2() {
gp := getg()
gp.asyncSafePoint = true
if gp.preemptStop {
mcall(preemptPark)
} else {
mcall(gopreempt_m)
}
gp.asyncSafePoint = false
}
Compare this to runtime/proc.go's Gosched function (documented as the way to voluntarily yield):
//go:nosplit
// Gosched yields the processor, allowing other goroutines to run. It does not
// suspend the current goroutine, so execution resumes automatically.
func Gosched() {
checkTimeouts()
mcall(gosched_m)
}
We see the main differences include some "async safe point" stuff and that we arrange for an M-stack-call to gopreempt_m instead of gosched_m. So, apart from the safety check stuff and a different trace call (not shown here) the involuntary preemption is almost exactly the same as voluntary preemption.
To find this, we had to dig rather deep into the (Go 1.14, in this case) implementation. One might not want to depend too much on this.

A little bit more on this to complete #torek's answer.
Goroutines are interruptible when there is a syscall, but also when a routine is waiting on a lock, a chan or sleeping.
As #torek's said, since 1.14 routines can also be preempted even when they do none of the above. The scheduler can mark any routine as preemptible after it ran for more than 10ms.
More reading there: https://medium.com/a-journey-with-go/go-goroutine-and-preemption-d6bc2aa2f4b7

Reading a Global Variable from a Thread and Writing to that Variable from another Thread

My program has 2 threads and a int global variable. One thread is reading from that variable and other thread is writing to that variable. Should I use mutex lock in this situation.
These functions are executing from 2 threads simultaneously and repetitively in my program.
void thread1()
{
if ( condition1 )
iVariable = 1;
else if ( condition2 )
iVariable = 2;
}
void thread2()
{
if ( iVariable == 1)
//do something
else if ( iVarable == 2 )
//do another thing
}

If you don't use any synchronization then it is entirely unpredictable when the 2nd thread sees the updated value. This ranges somewhere between a handful of nanoseconds and never. With the never outcome being particularly troublesome of course, it can happen on a x86 processor when you don't declare the variable volatile and you run the Release build of your program. It can take a long time on processors with a weak memory model, like ARM cores. The only thing you don't have to worry about is seeing a partially updated value, int updates are atomic.
That's about all that can be said about the posted code. Fine-grained locking rarely works well.

Yes you should (under most circumstances). Mutexes will ensure that the data you are protecting will be correctly visible from multiple contending CPUs. Unless you have a performance problem, you should use a mutex. If performance is an issue, look into lock free data structures.

Primitive synchronization primitives -- safe?

On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?

Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.

Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.

Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.

Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.

This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.

What is a semaphore?

A semaphore is a programming concept that is frequently used to solve multi-threading problems. My question to the community:
What is a semaphore and how do you use it?

Think of semaphores as bouncers at a nightclub. There are a dedicated number of people that are allowed in the club at once. If the club is full no one is allowed to enter, but as soon as one person leaves another person might enter.
It's simply a way to limit the number of consumers for a specific resource. For example, to limit the number of simultaneous calls to a database in an application.
Here is a very pedagogic example in C# :-)
using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
namespace TheNightclub
{
public class Program
{
public static Semaphore Bouncer { get; set; }
public static void Main(string[] args)
{
// Create the semaphore with 3 slots, where 3 are available.
Bouncer = new Semaphore(3, 3);
// Open the nightclub.
OpenNightclub();
}
public static void OpenNightclub()
{
for (int i = 1; i <= 50; i++)
{
// Let each guest enter on an own thread.
Thread thread = new Thread(new ParameterizedThreadStart(Guest));
thread.Start(i);
}
}
public static void Guest(object args)
{
// Wait to enter the nightclub (a semaphore to be released).
Console.WriteLine("Guest {0} is waiting to entering nightclub.", args);
Bouncer.WaitOne();
// Do some dancing.
Console.WriteLine("Guest {0} is doing some dancing.", args);
Thread.Sleep(500);
// Let one guest out (release one semaphore).
Console.WriteLine("Guest {0} is leaving the nightclub.", args);
Bouncer.Release(1);
}
}
}

The article Mutexes and Semaphores Demystified by Michael Barr is a great short introduction into what makes mutexes and semaphores different, and when they should and should not be used. I've excerpted several key paragraphs here.
The key point is that mutexes should be used to protect shared resources, while semaphores should be used for signaling. You should generally not use semaphores to protect shared resources, nor mutexes for signaling. There are issues, for instance, with the bouncer analogy in terms of using semaphores to protect shared resources - you can use them that way, but it may cause hard to diagnose bugs.
While mutexes and semaphores have some similarities in their implementation, they should always be used differently.
The most common (but nonetheless incorrect) answer to the question posed at the top is that mutexes and semaphores are very similar, with the only significant difference being that semaphores can count higher than one. Nearly all engineers seem to properly understand that a mutex is a binary flag used to protect a shared resource by ensuring mutual exclusion inside critical sections of code. But when asked to expand on how to use a "counting semaphore," most engineers—varying only in their degree of confidence—express some flavor of the textbook opinion that these are used to protect several equivalent resources.
...
At this point an interesting analogy is made using the idea of bathroom keys as protecting shared resources - the bathroom. If a shop has a single bathroom, then a single key will be sufficient to protect that resource and prevent multiple people from using it simultaneously.
If there are multiple bathrooms, one might be tempted to key them alike and make multiple keys - this is similar to a semaphore being mis-used. Once you have a key you don't actually know which bathroom is available, and if you go down this path you're probably going to end up using mutexes to provide that information and make sure you don't take a bathroom that's already occupied.
A semaphore is the wrong tool to protect several of the essentially same resource, but this is how many people think of it and use it. The bouncer analogy is distinctly different - there aren't several of the same type of resource, instead there is one resource which can accept multiple simultaneous users. I suppose a semaphore can be used in such situations, but rarely are there real-world situations where the analogy actually holds - it's more often that there are several of the same type, but still individual resources, like the bathrooms, which cannot be used this way.
...
The correct use of a semaphore is for signaling from one task to another. A mutex is meant to be taken and released, always in that order, by each task that uses the shared resource it protects. By contrast, tasks that use semaphores either signal or wait—not both. For example, Task 1 may contain code to post (i.e., signal or increment) a particular semaphore when the "power" button is pressed and Task 2, which wakes the display, pends on that same semaphore. In this scenario, one task is the producer of the event signal; the other the consumer.
...
Here an important point is made that mutexes interfere with real time operating systems in a bad way, causing priority inversion where a less important task may be executed before a more important task because of resource sharing. In short, this happens when a lower priority task uses a mutex to grab a resource, A, then tries to grab B, but is paused because B is unavailable. While it's waiting, a higher priority task comes along and needs A, but it's already tied up, and by a process that isn't even running because it's waiting for B. There are many ways to resolve this, but it most often is fixed by altering the mutex and task manager. The mutex is much more complex in these cases than a binary semaphore, and using a semaphore in such an instance will cause priority inversions because the task manager is unaware of the priority inversion and cannot act to correct it.
...
The cause of the widespread modern confusion between mutexes and semaphores is historical, as it dates all the way back to the 1974 invention of the Semaphore (capital "S", in this article) by Djikstra. Prior to that date, none of the interrupt-safe task synchronization and signaling mechanisms known to computer scientists was efficiently scalable for use by more than two tasks. Dijkstra's revolutionary, safe-and-scalable Semaphore was applied in both critical section protection and signaling. And thus the confusion began.
However, it later became obvious to operating system developers, after the appearance of the priority-based preemptive RTOS (e.g., VRTX, ca. 1980), publication of academic papers establishing RMA and the problems caused by priority inversion, and a paper on priority inheritance protocols in 1990, 3 it became apparent that mutexes must be more than just semaphores with a binary counter.
Mutex: resource sharing
Semaphore: signaling
Don't use one for the other without careful consideration of the side effects.

Mutex: exclusive-member access to a resource
Semaphore: n-member access to a resource
That is, a mutex can be used to syncronize access to a counter, file, database, etc.
A sempahore can do the same thing but supports a fixed number of simultaneous callers. For example, I can wrap my database calls in a semaphore(3) so that my multithreaded app will hit the database with at most 3 simultaneous connections. All attempts will block until one of the three slots opens up. They make things like doing naive throttling really, really easy.

Consider, a taxi that can accommodate a total of 3(rear)+2(front) persons including the driver. So, a semaphore allows only 5 persons inside a car at a time.
And a mutex allows only 1 person on a single seat of the car.
Therefore, Mutex is to allow exclusive access for a resource (like an OS thread) while a Semaphore is to allow access for n number of resources at a time.

#Craig:
A semaphore is a way to lock a
resource so that it is guaranteed that
while a piece of code is executed,
only this piece of code has access to
that resource. This keeps two threads
from concurrently accesing a resource,
which can cause problems.
This is not restricted to only one thread. A semaphore can be configured to allow a fixed number of threads to access a resource.

Semaphore can also be used as a ... semaphore.
For example if you have multiple process enqueuing data to a queue, and only one task consuming data from the queue. If you don't want your consuming task to constantly poll the queue for available data, you can use semaphore.
Here the semaphore is not used as an exclusion mechanism, but as a signaling mechanism.
The consuming task is waiting on the semaphore
The producing task are posting on the semaphore.
This way the consuming task is running when and only when there is data to be dequeued

There are two essential concepts to building concurrent programs - synchronization and mutual exclusion. We will see how these two types of locks (semaphores are more generally a kind of locking mechanism) help us achieve synchronization and mutual exclusion.
A semaphore is a programming construct that helps us achieve concurrency, by implementing both synchronization and mutual exclusion. Semaphores are of two types, Binary and Counting.
A semaphore has two parts : a counter, and a list of tasks waiting to access a particular resource. A semaphore performs two operations : wait (P) [this is like acquiring a lock], and release (V)[ similar to releasing a lock] - these are the only two operations that one can perform on a semaphore. In a binary semaphore, the counter logically goes between 0 and 1. You can think of it as being similar to a lock with two values : open/closed. A counting semaphore has multiple values for count.
What is important to understand is that the semaphore counter keeps track of the number of tasks that do not have to block, i.e., they can make progress. Tasks block, and add themselves to the semaphore's list only when the counter is zero. Therefore, a task gets added to the list in the P() routine if it cannot progress, and "freed" using the V() routine.
Now, it is fairly obvious to see how binary semaphores can be used to solve synchronization and mutual exclusion - they are essentially locks.
ex. Synchronization:
thread A{
semaphore &s; //locks/semaphores are passed by reference! think about why this is so.
A(semaphore &s): s(s){} //constructor
foo(){
...
s.P();
;// some block of code B2
...
}
//thread B{
semaphore &s;
B(semaphore &s): s(s){} //constructor
foo(){
...
...
// some block of code B1
s.V();
..
}
main(){
semaphore s(0); // we start the semaphore at 0 (closed)
A a(s);
B b(s);
}
In the above example, B2 can only execute after B1 has finished execution. Let's say thread A comes executes first - gets to sem.P(), and waits, since the counter is 0 (closed). Thread B comes along, finishes B1, and then frees thread A - which then completes B2. So we achieve synchronization.
Now let's look at mutual exclusion with a binary semaphore:
thread mutual_ex{
semaphore &s;
mutual_ex(semaphore &s): s(s){} //constructor
foo(){
...
s.P();
//critical section
s.V();
...
...
s.P();
//critical section
s.V();
...
}
main(){
semaphore s(1);
mutual_ex m1(s);
mutual_ex m2(s);
}
The mutual exclusion is quite simple as well - m1 and m2 cannot enter the critical section at the same time. So each thread is using the same semaphore to provide mutual exclusion for its two critical sections. Now, is it possible to have greater concurrency? Depends on the critical sections. (Think about how else one could use semaphores to achieve mutual exclusion.. hint hint : do i necessarily only need to use one semaphore?)
Counting semaphore: A semaphore with more than one value. Let's look at what this is implying - a lock with more than one value?? So open, closed, and ...hmm. Of what use is a multi-stage-lock in mutual exclusion or synchronization?
Let's take the easier of the two:
Synchronization using a counting semaphore: Let's say you have 3 tasks - #1 and 2 you want executed after 3. How would you design your synchronization?
thread t1{
...
s.P();
//block of code B1
thread t2{
...
s.P();
//block of code B2
thread t3{
...
//block of code B3
s.V();
s.V();
}
So if your semaphore starts off closed, you ensure that t1 and t2 block, get added to the semaphore's list. Then along comes all important t3, finishes its business and frees t1 and t2. What order are they freed in? Depends on the implementation of the semaphore's list. Could be FIFO, could be based some particular priority,etc. (Note : think about how you would arrange your P's and V;s if you wanted t1 and t2 to be executed in some particular order, and if you weren't aware of the implementation of the semaphore)
(Find out : What happens if the number of V's is greater than the number of P's?)
Mutual Exclusion Using counting semaphores: I'd like you to construct your own pseudocode for this (makes you understand things better!) - but the fundamental concept is this : a counting semaphore of counter = N allows N tasks to enter the critical section freely. What this means is you have N tasks (or threads, if you like) enter the critical section, but the N+1th task gets blocked (goes on our favorite blocked-task list), and only is let through when somebody V's the semaphore at least once. So the semaphore counter, instead of swinging between 0 and 1, now goes between 0 and N, allowing N tasks to freely enter and exit, blocking nobody!
Now gosh, why would you need such a stupid thing? Isn't the whole point of mutual exclusion to not let more than one guy access a resource?? (Hint Hint...You don't always only have one drive in your computer, do you...?)
To think about : Is mutual exclusion achieved by having a counting semaphore alone? What if you have 10 instances of a resource, and 10 threads come in (through the counting semaphore) and try to use the first instance?

I've created the visualization which should help to understand the idea. Semaphore controls access to a common resource in a multithreading environment.
ExecutorService executor = Executors.newFixedThreadPool(7);
Semaphore semaphore = new Semaphore(4);
Runnable longRunningTask = () -> {
boolean permit = false;
try {
permit = semaphore.tryAcquire(1, TimeUnit.SECONDS);
if (permit) {
System.out.println("Semaphore acquired");
Thread.sleep(5);
} else {
System.out.println("Could not acquire semaphore");
}
} catch (InterruptedException e) {
throw new IllegalStateException(e);
} finally {
if (permit) {
semaphore.release();
}
}
};
// execute tasks
for (int j = 0; j < 10; j++) {
executor.submit(longRunningTask);
}
executor.shutdown();
Output
Semaphore acquired
Semaphore acquired
Semaphore acquired
Semaphore acquired
Could not acquire semaphore
Could not acquire semaphore
Could not acquire semaphore
Sample code from the article

A semaphore is an object containing a natural number (i.e. a integer greater or equal to zero) on which two modifying operations are defined. One operation, V, adds 1 to the natural. The other operation, P, decreases the natural number by 1. Both activities are atomic (i.e. no other operation can be executed at the same time as a V or a P).
Because the natural number 0 cannot be decreased, calling P on a semaphore containing a 0 will block the execution of the calling process(/thread) until some moment at which the number is no longer 0 and P can be successfully (and atomically) executed.
As mentioned in other answers, semaphores can be used to restrict access to a certain resource to a maximum (but variable) number of processes.

A hardware or software flag. In multi tasking systems , a semaphore is as variable with a value that indicates the status of a common resource.A process needing the resource checks the semaphore to determine the resources status and then decides how to proceed.

Semaphores are act like thread limiters.
Example: If you have a pool of 100 threads and you want to perform some DB operation. If 100 threads access the DB at a given time, then there may be locking issue in DB so we can use semaphore which allow only limited thread at a time.Below Example allow only one thread at a time. When a thread call the acquire() method, it will then get the access and after calling the release() method, it will release the acccess so that next thread will get the access.
package practice;
import java.util.concurrent.Semaphore;
public class SemaphoreExample {
public static void main(String[] args) {
Semaphore s = new Semaphore(1);
semaphoreTask s1 = new semaphoreTask(s);
semaphoreTask s2 = new semaphoreTask(s);
semaphoreTask s3 = new semaphoreTask(s);
semaphoreTask s4 = new semaphoreTask(s);
semaphoreTask s5 = new semaphoreTask(s);
s1.start();
s2.start();
s3.start();
s4.start();
s5.start();
}
}
class semaphoreTask extends Thread {
Semaphore s;
public semaphoreTask(Semaphore s) {
this.s = s;
}
#Override
public void run() {
try {
s.acquire();
Thread.sleep(1000);
System.out.println(Thread.currentThread().getName()+" Going to perform some operation");
s.release();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}

So imagine everyone is trying to go to the bathroom and there's only a certain number of keys to the bathroom. Now if there's not enough keys left, that person needs to wait. So think of semaphore as representing those set of keys available for bathrooms (the system resources) that different processes (bathroom goers) can request access to.
Now imagine two processes trying to go to the bathroom at the same time. That's not a good situation and semaphores are used to prevent this. Unfortunately, the semaphore is a voluntary mechanism and processes (our bathroom goers) can ignore it (i.e. even if there are keys, someone can still just kick the door open).
There are also differences between binary/mutex & counting semaphores.
Check out the lecture notes at http://www.cs.columbia.edu/~jae/4118/lect/L05-ipc.html.

This is an old question but one of the most interesting uses of semaphore is a read/write lock and it has not been explicitly mentioned.
The r/w locks works in simple fashion: consume one permit for a reader and all permits for writers.
Indeed, a trivial implementation of a r/w lock but requires metadata modification on read (actually twice) that can become a bottle neck, still significantly better than a mutex or lock.
Another downside is that writers can be started rather easily as well unless the semaphore is a fair one or the writes acquire permits in multiple requests, in such case they need an explicit mutex between themselves.
Further read:

Mutex is just a boolean while semaphore is a counter.
Both are used to lock part of code so it's not accessed by too many threads.
Example
lock.set()
a += 1
lock.unset()
Now if lock was a mutex, it means that it will always be locked or unlocked (a boolean under the surface) regardless how many threads try access the protected snippet of code. While locked, any other thread would just wait until it's unlocked/unset by the previous thread.
Now imagine if instead lock was under the hood a counter with a predefined MAX value (say 2 for our example). Then if 2 threads try to access the resource, then lock would get its value increased to 2. If a 3rd thread then tried to access it, it would simply wait for the counter to go below 2 and so on.
If lock as a semaphore had a max of 1, then it would be acting exactly as a mutex.

A semaphore is a way to lock a resource so that it is guaranteed that while a piece of code is executed, only this piece of code has access to that resource. This keeps two threads from concurrently accesing a resource, which can cause problems.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string