So the consensus number for a fetch-and-add operation is 2
I'm having a hard time grasping what that means and how it affects multithread programming? I would love some practical examples on how that would affect compared to a compareandswap for example ... Thank you very much.
Fetch-and-add gives you the ability to read and modify and location in memory atomically. Consensus means that protocol (fetch-and-add in this case) can give consensus for n different threads.
What does that mean?
Our goal is to use a protocol, where e.g. one thread determines a value and that value gets adopted by all other threads. You can google some of those protocols. In multithreading, it is important that we can modify memory atomically and that threads sometimes decide on one value.
It is important to note that consensus protocols are wait-free, that means that all threads make progress even when one dies (for some reason). Therefore it is important, when using multithreading, to be aware of an objects consensus number. A consensus number of e.g. an atomic register (1) tells us that we will never be able to implement an object using only atomic registers that can give consensus for 2 threads. That's why we use constructs such as fetch-and-add in multithreading.
Example
Scheduling in an OS is usually done with FIFO queues. FIFO queues have consensus two so it is safe to use them for multithreading.
Related
I've being trying to find an answer for that, and all I could find it that once a thread reaches a critical section it locks it in front of other threads (or some other lock mechanism is being used to lock the critical section).
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Can I simply assume the the program will malfunction?
Note: I am referencing to a multicore CPUs.
Thanks.
I think you are missing the point of the fundamental locking primitives like Semaphores. If correct primitive is used, and used correctly, then the timing of the threads do not matter. They may well be simultaneous. The Operating System guarantees that no two thread will enter the critical section. Even on multicore machines, this bit is specially implemented (with lots of trickery even) to get that assurance.
To address your concerns specifically:
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
No. The other threads could have reached in the same microsecond, BUT if the locking mechanism is correct, then only one the competing threads will "enter" the critical section and others will wait.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.
Can I simply assume the the program will malfunction?
Ideally the program should not malfunction. But any code will have bugs - so does your code and the Operating System code for the Semaphores. So it is safe to assume that in some edge cases the program will indeed malfunction. But this assumption is true for any code in general.
Locking and Critical Sections are rather tricky to correctly implement. So for non academic purposes we should always use the system provided locking primitives. All Operating Systems expose stuff like Semaphores which most programming languages have ways to use. Some programming languages have their own lightweight implementations which provide somewhat softer guarantees but at a higher performance. As I said, while doing Critical Sections, it is critical to choose the correct thing and also to implement it correctly.
...But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Short answer; Memory system hardware makes it impossible for two different processors to access the same memory location at the same time. I'm not a computer architect, so I can't explain how it works, but the memory system serializes all of the accesses to the shared, main memory by the various CPUs in a multi-CPU system.
"Entering a critical section" means locking a mutex, and a mutex basically is just a flag in shared memory that is accesses by a specific protocol.
It is the task of the cache coherence protocol to make sure there are no 2 writes on the same chunk of memory (cache line) at the same time. With MESI there can be multiple readers of the same cacheline, but only 1 writer.
So if 2 threads at the same time want to write to the same cacheline, their requests will be serialized by cache coherence protocol.
Most CPU architecture support atomic operations like CAS. On the X86 this can be done using a lock prefix. The CPU will lock the cacheline when it starts with the CAS instruction and will not respond to cache coherence requests from other cores till it is finished with the atomic operation.
So if you would have 2 CPUs that both want to do a CAS, these operations are serialized by the underlying hardware.
I've been working a lot with concurrency at the practical level, and therefore I've also started to study it theoretically to gain insight into this field of computer science.
However, I've trouble understanding the following:
Why cannot a Lock for 2-threads be implemented using only 1 shared variable satisfying mutual exclusion and deadlock freedom?
More generally, why is at least n shared variables needed for a n-thread lock satisfying mutual exclusion and deadlock freedom?
Consider two threads A and B. I see that A must write to this variable in order to signify it acquires the lock. The variable could be a boolean. Is it because that A needs to read the variable before writing it, and this is two operations? (not done atomically)
Most likely, you're reading things that make assumptions about the platform's capabilities that are no longer realistic. You're probably considering the case where a CPU has no prefetching, no posted writes, total read and store ordering, no compiler optimization that affect memory visibility or memory operation ordering, and no risk of word tearing, but does not have an atomic "read-modify-write" operation like increment or compare-exchange. With these assumptions, there's really no way to do it with one variable.
This is an interesting theoretical problem, but has very little practical relevance. Modern CPUs do have all of those optimizations -- they prefetch reads, they post writes to buffers, they re-order reads and stores, and compilers optimize away memory options. Word tearing is typically not an issue for aligned operations to native integer types. But, more importantly, modern CPUs have sophisticated, high-performance atomic operations such as increment, decrement, compare-exchange, and so on.
When you write synchronization primitives, the exercise is highly platform-specific. The combination of capabilities available to you varies from platform to platform. Even more importantly, their costs vary drastically from platform to platform, so even if many solutions are possible, they may not be equally good.
Lastly, you have to have a deep understanding of what each primitive actually makes the platform do. For example, on modern Intel CPUs, there is hyper-threading. It's important that, for example, a thread waiting for a spinlock doesn't starve another thread sharing the physical core. That requires deep understanding of how hyper-threading actually works. Similarly, it's easy to code a spinlock so that you take the mother of all mispredicted branches when you acquire the lock and blow out the pipelines at the instant where performance is the most critical. You need to understand how branch prediction works and how it interacts with instruction pipelining to avoid this issue.
The vast majority of programmers should never, ever write synchronization primitives and use them in actual, real world code. Getting them to work with assured reliability is hard, and getting them to perform properly is much, much harder. And to top it off, it's not possible to measure their performance easily. (Of course, it's great to experiment, so long as you don't get an exaggerated sense of the usefulness of your experimental code.)
Concurrent data structures:
[1] http://www.cs.tau.ac.il/~shanir/concurrent-data-structures.pdf
[2] http://en.wikipedia.org/wiki/Concurrent_data_structure
Are concurrent data structures and thread safe data structures interchangeable terms?
My understanding differs from #xxa (the answer is: No). Though I have no a regious definition either. Concurrent implies thread-safety. But nowadays, it implies simultaneous access as well. While thread-safety gives no such assumptions. See this quote form the mentioned Wiki-article:
Today, as multiprocessor computer architectures that provide parallelism become the dominant computing platform (through the proliferation of multi-core processors), the term has come to stand mainly for data structures that can be accessed by multiple threads which may actually access the data simultaneously because they run on different processors that communicate with one another.
For example, STL containers are claimed to be thread-safe for the given conditions (read-only), moreover, it allows simultaneous reads by a number of threads (STL says them are "as safe as int"), but only one thread can modify them and in absence of readers. Can we name them 'concurrent'? No. While practical concurrent containers (see tbb for example) allow at least two or more threads to work with the container (including modification) at the same time.
And one more point. You could implement std::queue so that methods push() and pop() will not trigger a failure when used by different threads. But does this make it a concurrent_queue? No. Because, queue::front() and queue::pop() do not provide a way to get elements simultaneously by two or more threads without external synchronization. To become a concurrent_queue it needs different interface, which takes care of atomicity by combining pop() with returning the value.
These terms do not have a rigorous definition, but in general it can be asserted that the answer to your questions is yes.
Both refer to data structures that are stored in shared memory (http://en.wikipedia.org/wiki/Shared_memory) and manipulated by several threads or processes.
In fact [1] asserts the same when it refers to threads explicitly:
The primary source of this additional difficulty is concurrency: Because threads are executed concurrently on different processors, ..."
I'm reading up on concurrency. I've got a bit over my head with terms that have confusingly similar definitions. Namely:
Processes
Threads
"Green threads"
Protothreads
Fibers
Coroutines
"Goroutines" in the Go language
My impression is that the distinctions rest on (1) whether truly parallel or multiplexed; (2) whether managed at the CPU, at the OS, or in the program; and (3..5) a few other things I can't identify.
Is there a succinct and unambiguous guide to the differences between these approaches to parallelism?
OK, I'm going to do my best. There are caveats everywhere, but I'm going to do my best to give my understanding of these terms and references to something that approximates the definition I've given.
Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
Coroutines: Exactly fibers, except not OS-managed.
Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.
It's also worth noting that there are other understandings in concurrency theory of the term "process", in the process calculus sense. This definition is orthogonal to those above, but I just thought it worth mentioning so that no confusion arises should you see process used in that sense somewhere.
Also, be aware of the difference between parallel and concurrent. It's possible you were using the former in your question where I think you meant the latter.
I mostly agree with Gian's answer, but I have different interpretations of a few concurrency primitives. Note that these terms are often used inconsistently by different authors. These are my favorite definitions (hopefully not too far from the modern consensus).
Process:
OS-managed
Each has its own virtual address space
Can be interrupted (preempted) by the system to allow another process to run
Can run in parallel with other processes on different processors
The memory overhead of processes is high (includes virtual memory tables, open file handles, etc)
The time overhead for creating and context switching between processes is relatively high
Threads:
OS-managed
Each is "contained" within some particular process
All threads in the same process share the same virtual address space
Can be interrupted by the system to allow another thread to run
Can run in parallel with other threads on different processors
The memory and time overheads associated with threads are smaller than processes, but still non-trivial
(For example, typically context switching involves entering the kernel and invoking the system scheduler.)
Cooperative Threads:
May or may not be OS-managed
Each is "contained" within some particular process
In some implementations, each is "contained" within some particular OS thread
Cannot be interrupted by the system to allow a cooperative peer to run
(The containing process/thread can still be interrupted, of course)
Must invoke a special yield primitive to allow peer cooperative threads to run
Generally cannot be run in parallel with cooperative peers
(Though some people think it's possible: http://ocm.dreamhosters.com/.)
There are lots of variations on the cooperative thread theme that go by different names:
Fibers
Green threads
Protothreads
User-level threads (user-level threads can be interruptable/preemptive, but that's a relatively unusual combination)
Some implementations of cooperative threads use techniques like split/segmented stacks or even individually heap-allocating every call frame to reduce the memory overhead associated with pre-allocating a large chunk of memory for the stack
Depending on the implementation, calling a blocking syscall (like reading from the network or sleeping) will either cause a whole group of cooperative threads to block or implicitly cause the calling thread to yield
Coroutines:
Some people use "coroutine" and "cooperative thread" more or less synonymously
I do not prefer this usage
Some coroutine implementations are actually "shallow" cooperative threads; yield can only be invoked by the "coroutine entry procedure"
The shallow (or semi-coroutine) version is easier to implement than threads, because each coroutine does not need a complete stack (just one frame for the entry procedure)
Often coroutine frameworks have yield primitives that require the invoker to explicitly state which coroutine control should transfer to
Generators:
Restricted (shallow) coroutines
yield can only return control back to whichever code invoked the generator
Goroutines:
An odd hybrid of cooperative and OS threads
Cannot be interrupted (like cooperative threads)
Can run in parallel on a language runtime-managed pool of OS threads
Event handlers:
Procedures/methods that are invoked by an event dispatcher in response to some action happening
Very popular for user interface programming
Require little to no language/system support; can be implemented in a library
At most one event handler can be running at a time; the dispatcher must wait for a handler to finish (return) before starting the next
Makes synchronization relatively simple; different handler executions never overlap in time
Implementing complex tasks with event handlers tends to lead to "inverted control flow"/"stack ripping"
Tasks:
Units of work that are doled out by a manager to a pool of workers
The workers can be threads, processes or machines
Of course the kind of worker a task library uses has a significant impact on how one implements the tasks
In this list of inconsistently and confusingly used terminology, "task" takes the crown. Particularly in the embedded systems community, "task" is sometimes used to mean "process", "thread" or "event handler" (usually called an "interrupt service routine"). It is also sometimes used generically/informally to refer to any kind of unit of computation.
One pet peeve that I can't stop myself from airing: I dislike the use of the phrase "true concurrency" for "processor parallelism". It's quite common, but I think it leads to much confusion.
For most applications, I think task-based frameworks are best for parallelization. Most of the popular ones (Intel's TBB, Apple's GCD, Microsoft's TPL & PPL) use threads as workers. I wish there were some good alternatives that used processes, but I'm not aware of any.
If you're interested in concurrency (as opposed to processor parallelism), event handlers are the safest way to go. Cooperative threads are an interesting alternative, but a bit of a wild west. Please do not use threads for concurrency if you care about the reliability and robustness of your software.
Protothreads are just a switch case implementation that acts like a state machine but makes implementation of the software a whole lot simpler. It is based around idea of saving a and int value before a case label and returning and then getting back to the point after the case by reading back that variable and using switch to figure out where to continue. So protothread are a sequential implementation of a state machine.
Protothreads are great when implementing sequential state machines. Protothreads are not really threads at all, but rather a syntax abstraction that makes it much easier to write a switch/case state machine that has to switch states sequentially (from one to the next etc..).
I have used protothreads to implement asynchronous io: http://martinschroder.se/asynchronous-io-using-protothreads/
Back in my days as a BeOS programmer, I read this article by Benoit Schillings, describing how to create a "benaphore": a method of using atomic variable to enforce a critical section that avoids the need acquire/release a mutex in the common (no-contention) case.
I thought that was rather clever, and it seems like you could do the same trick on any platform that supports atomic-increment/decrement.
On the other hand, this looks like something that could just as easily be included in the standard mutex implementation itself... in which case implementing this logic in my program would be redundant and wouldn't provide any benefit.
Does anyone know if modern locking APIs (e.g. pthread_mutex_lock()/pthread_mutex_unlock()) use this trick internally? And if not, why not?
What your article describes is in common use today. Most often it's called "Critical Section", and it consists of an interlocked variable, a bunch of flags and an internal synchronization object (Mutex, if I remember correctly). Generally, in the scenarios with little contention, the Critical Section executes entirely in user mode, without involving the kernel synchronization object. This guarantees fast execution. When the contention is high, the kernel object is used for waiting, which releases the time slice conductive for faster turnaround.
Generally, there is very little sense in implementing synchronization primitives in this day and age. Operating systems come with a big variety of such objects, and they are optimized and tested in significantly wider range of scenarios than a single programmer can imagine. It literally takes years to invent, implement and test a good synchronization mechanism. That's not to say that there is no value in trying :)
Java's AbstractQueuedSynchronizer (and its sibling AbstractQueuedLongSynchronizer) works similarly, or at least it could be implemented similarly. These types form the basis for several concurrency primitives in the Java library, such as ReentrantLock and FutureTask.
It works by way of using an atomic integer to represent state. A lock may define the value 0 as unlocked, and 1 as locked. Any thread wishing to acquire the lock attempts to change the lock state from 0 to 1 via an atomic compare-and-set operation; if the attempt fails, the current state is not 0, which means that the lock is owned by some other thread.
AbstractQueuedSynchronizer also facilitates waiting on locks and notification of conditions by maintaining CLH queues, which are lock-free linked lists representing the line of threads waiting either to acquire the lock or to receive notification via a condition. Such notification moves one or all of the threads waiting on the condition to the head of the queue of those waiting to acquire the related lock.
Most of this machinery can be implemented in terms of an atomic integer representing the state as well as a couple of atomic pointers for each waiting queue. The actual scheduling of which threads will contend to inspect and change the state variable (via, say, AbstractQueuedSynchronizer#tryAcquire(int)) is outside the scope of such a library and falls to the host system's scheduler.