How coroutines can be faster than threads? - multithreading

I am trying to find a situation where changing multithreading to coroutines will speed up processing of the affected code section. As far as I discovered that coroutines use less CPU and Heap space comparing to threads, I still can't find the case where coroutines are faster than threads. Although I know that coroutines creation and context-switching are much cheaper than corresponding operations with threads, I've got imperceptible results in speed difference (without measuring thread creation both cases will be absolutely the same).
So, is it even possible to find a case where coroutines will faster an execution more than threads?

One thing to note is that coroutines are vastly superior when you have lots and lots of them. You can create and execute thousands of coroutines without a second thought, if you attempted to do that via threads all the overhead associated with threads might quickly kill the host. So, this enables you to think about massive parallelization without having to manage worker threads and runnables. They also make it easy to implement asynchronous computation patterns which would be very unwieldy to implement with basic threads, like channels and actors.
Out of scope regarding your question, but still noteworthy is the genericity of the concept, as the use cases for coroutines are not just limited to asynchronous computation. The core of coroutines are suspendable functions, which for example also enables generators like you have in python, which you would not immediately connect to asynchronous programming.

Related

Goroutines vs asyncio tasks + thread pool for CPU-bound calls

Are goroutines roughly equivalent to python's asyncio tasks, with an additional feature that any CPU-bound task is routed to a ThreadPoolExecutor instead of being added to the event loop (of course, with the assumption that we use a python interpreter without GIL)?
Is there any substantial difference between the two approaches that I'm missing? Of course, apart from the efficiencies and code clarity that result from the concurrency being an integral part of Go.
I think I know part of the answer. I tried to summarize my understanding of the differences, in order of importance, between asyncio tasks and goroutines:
1) Unlike under asyncio, one rarely needs to worry that their goroutine will block for too long. OTOH, memory sharing across goroutines is akin to memory sharing across threads rather than asyncio tasks since goroutine execution order guarantees are much weaker (even if the hardware has only a single core).
asyncio will only switch context on explicit await, yield and certain event loop methods, while Go runtime may switch on far more subtle triggers (such as certain function calls). So asyncio is perfectly cooperative, while goroutines are only mostly cooperative (and the roadmap suggests they will become even less cooperative over time).
A really tight loop (such as with numeric computation) could still block Go runtime (well, the thread it's running on). If it happens, it's going to have less of an impact than in python - unless it occurs in mutliple threads.
2) Goroutines are have off-the-shelf support for parallel computation, which would require a more sophisticated approach under asyncio.
Go runtime can run threads in parallel (if multiple cores are available), and so it's somewhat similar to running multiple asyncio event loops in a thread pool under a GIL-less python runtime, with a language-aware load balancer in front.
3) Go runtime will automatically handle blocking syscalls in a separate thread; this needs to be done explicitly under asyncio (e.g., using run_in_executor).
That said, in terms of memory cost, goroutines are very much like asyncio tasks rather than threads.
I suppose you could think of it working that way underneath, sure. It's not really accurate, but, close enough.
But there is a big difference: in Go you can write straight line code, and all the I/O blocking is handled for you automatically. You can call Read, then Write, then Read, in simple straight line code. With Python asyncio, as I understand it, you need to queue up a function to handle the reads, rather than just calling Read.

Lightweight Threads in Operating Systems

It is said that one of the main benefits of Node (and presumable twisted et al) over more conventional threaded servers, is the very high concurrency enabled by the event loop model. The biggest reason for this is that each thread has a high memory footprint and swapping contexts is comparatively expensive. When you have thousands of threads the server spends most of its time swapping from thread to thread.
My question is, why don't operating systems or the underlying hardware support much more lightweight threads? If they did, could you solve the 10k problem with plain threads? If they can't, why is that?
Modern operating systems can support the execution of a very large number of threads.
More generally, hardware keeps getting faster (and recently, it has been getting faster in a way that is much friendlier to multithreading and multiprocessing than to single-threaded event loops - ie, increased number of cores, rather than increased processing throughput capabilities in a single core). If you can't afford the overhead of a thread today, you can probably afford it tomorrow.
What the cooperative multitasking systems of Twisted (and presumably Node.js et al) offers over pre-emptive multithreading (at least in the form of pthreads) is ease of programming.
Correctly using multithreading involves being much more careful than correctly using a single thread. An event loop is just the means of getting multiple things done without going beyond your single thread.
Considering the proliferation of parallel hardware, it would be ideal for multithreading or multiprocessing to get easier to do (and easier to do correctly). Actors, message passing, maybe even petri nets are some of the solutions people have attempted to solve this problem. They are still very marginal compared to the mainstream multithreading approach (pthreads). Another approach is SEDA, which uses multiple threads to run multiple event loops. This also hasn't caught on.
So, the people using event loops have probably decided that programmer time is worth more than CPU time, and the people using pthreads have probably decided the opposite, and the people exploring actors and such would like to value both kinds of time more highly (clearly insane, which is probably why no one listens to them).
The issue isn't really how heavyweight the threads are but the fact that to write correct multithreaded code you need locks on shared items and that prevents it from scaling with the number of threads because threads end up waiting for each other to gain locks and you rapidly reach the point where adding additional threads has no effect or even slows the system down as you get more lock contention.
In many cases you can avoid locking, but it's very difficult to get right, and sometimes you simply need a lock.
So if you are limited to a small number of threads, you might well find that removing the overhead of having to lock resources at all, or even think about it, makes a single threaded program faster than a multithreaded program no matter how many threads you add.
Basically locks can (depending on your program) be really expensive and can stop your program scaling beyond a few threads. And you almost always need to lock something.
It's not the overhead of a thread that's the problem, it's the synchronization between the threads. Even if you could switch between threads instantly, and had infinite memory none of that helps if each thread just ends up waiting in a queue for it's turn at some shared resource.

Thread pooling and multi core systems

Do you think threadpooling design pattern is the way to go for the multi-core future?
A threadpooling library for instance, if used extensively, makes/force the application writer
(1) to break the problem into separate parallel jobs hence promoting (enforcing :) ) parallelism
(2) With abstraction from all the low level OS calls, synchronization etc etc makes programmer's life easier. (Especially for C programmers :) )
I have strong belief its the best way (Or One of the "best" ways :) ) for multi-core future...
So, my question is, am I write in thinking so or I am in some delusion :)
Regards,
Microkernel
Thread pooling is a technique that involves a queue and a number of threads taking jobs from the queue and process them. This is in contrast to the technique of just starting new threads whenever a new task arrives.
Benefits are that the maximum number of threads is limited to avoid too much threading and that there is less overhead involved with any new task (Thread is already running and takes task. No threat starting needed).
Whether this is a good design highly depends on your problem. If you have many short jobs that come to your program at a very fast rate, then this is a good idea because the lower overhead is really a benefit. If you have extremely large numbers of concurrent tasks this is a good idea to keep your scheduler from having to do too much work.
There are many areas where thread pooling is just not helpful. So you cannot generalize. Sometimes multi threading at all is not possible. Or not even desired, as multi threading adds an unpredictable element (race conditions) to your code which is extremely hard to debug.
A thread pooling library can hardly "force" you to use it. You still need to think stuff through and if you just start one thread... Won't help.
As almost every informatics topic the answer is: It Depends.
the pooling system is fine with Embarrassingly parallel http://en.wikipedia.org/wiki/Embarrassingly_parallel
For other task where you need more syncornization between threads it's not that good
For the Windows NT engine thread pools are usually much less efficient than I/O Completion Ports. These are covered extensively in numerous questions and answers here. IOCPs enable event-driven processing in that multiple threads can wait on the IOCP until an event occurs due to an IOC (read or write) on a socket or handle which is then queued to the IOCP. The IOCP in turn pairs a waiting thread with the id of the event and releases the thread for processing. After the thread has processed the event and initiated a new I/O it returns to the IOCP to wait for the next event (which may or may not be the completion of the I/O it just initiated).
IOCs may also be artificially signalled by explicit posting from a non-event.
Using IOCPs is not polling. The optimal IOCP implementation will have as many threads waiting on the IOCP as there are cores in the system. The threads may all execute the same physical code if that is deemed efficient. Since a thread processes from an IOC up until it issues an I/O it does nothing which forces it to wait for other resources except perhaps to compete for access to thread-safe areas. It is a natural choice to move away from the "one handle per thread" paradigm. IOCP-controlled threads are therefore as efficient as the programmer is able to construct them.
I like the answer by #yaankee a lot except I would argue that thread pool is almost always the right way to go. The reason: a thread pool can degenerate itself into a simple static work partitioning model for problems like matrix-matrix multiply. OpenMP guided is kind of along those lines.

Processes, threads, green threads, protothreads, fibers, coroutines: what's the difference?

I'm reading up on concurrency. I've got a bit over my head with terms that have confusingly similar definitions. Namely:
Processes
Threads
"Green threads"
Protothreads
Fibers
Coroutines
"Goroutines" in the Go language
My impression is that the distinctions rest on (1) whether truly parallel or multiplexed; (2) whether managed at the CPU, at the OS, or in the program; and (3..5) a few other things I can't identify.
Is there a succinct and unambiguous guide to the differences between these approaches to parallelism?
OK, I'm going to do my best. There are caveats everywhere, but I'm going to do my best to give my understanding of these terms and references to something that approximates the definition I've given.
Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
Coroutines: Exactly fibers, except not OS-managed.
Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.
It's also worth noting that there are other understandings in concurrency theory of the term "process", in the process calculus sense. This definition is orthogonal to those above, but I just thought it worth mentioning so that no confusion arises should you see process used in that sense somewhere.
Also, be aware of the difference between parallel and concurrent. It's possible you were using the former in your question where I think you meant the latter.
I mostly agree with Gian's answer, but I have different interpretations of a few concurrency primitives. Note that these terms are often used inconsistently by different authors. These are my favorite definitions (hopefully not too far from the modern consensus).
Process:
OS-managed
Each has its own virtual address space
Can be interrupted (preempted) by the system to allow another process to run
Can run in parallel with other processes on different processors
The memory overhead of processes is high (includes virtual memory tables, open file handles, etc)
The time overhead for creating and context switching between processes is relatively high
Threads:
OS-managed
Each is "contained" within some particular process
All threads in the same process share the same virtual address space
Can be interrupted by the system to allow another thread to run
Can run in parallel with other threads on different processors
The memory and time overheads associated with threads are smaller than processes, but still non-trivial
(For example, typically context switching involves entering the kernel and invoking the system scheduler.)
Cooperative Threads:
May or may not be OS-managed
Each is "contained" within some particular process
In some implementations, each is "contained" within some particular OS thread
Cannot be interrupted by the system to allow a cooperative peer to run
(The containing process/thread can still be interrupted, of course)
Must invoke a special yield primitive to allow peer cooperative threads to run
Generally cannot be run in parallel with cooperative peers
(Though some people think it's possible: http://ocm.dreamhosters.com/.)
There are lots of variations on the cooperative thread theme that go by different names:
Fibers
Green threads
Protothreads
User-level threads (user-level threads can be interruptable/preemptive, but that's a relatively unusual combination)
Some implementations of cooperative threads use techniques like split/segmented stacks or even individually heap-allocating every call frame to reduce the memory overhead associated with pre-allocating a large chunk of memory for the stack
Depending on the implementation, calling a blocking syscall (like reading from the network or sleeping) will either cause a whole group of cooperative threads to block or implicitly cause the calling thread to yield
Coroutines:
Some people use "coroutine" and "cooperative thread" more or less synonymously
I do not prefer this usage
Some coroutine implementations are actually "shallow" cooperative threads; yield can only be invoked by the "coroutine entry procedure"
The shallow (or semi-coroutine) version is easier to implement than threads, because each coroutine does not need a complete stack (just one frame for the entry procedure)
Often coroutine frameworks have yield primitives that require the invoker to explicitly state which coroutine control should transfer to
Generators:
Restricted (shallow) coroutines
yield can only return control back to whichever code invoked the generator
Goroutines:
An odd hybrid of cooperative and OS threads
Cannot be interrupted (like cooperative threads)
Can run in parallel on a language runtime-managed pool of OS threads
Event handlers:
Procedures/methods that are invoked by an event dispatcher in response to some action happening
Very popular for user interface programming
Require little to no language/system support; can be implemented in a library
At most one event handler can be running at a time; the dispatcher must wait for a handler to finish (return) before starting the next
Makes synchronization relatively simple; different handler executions never overlap in time
Implementing complex tasks with event handlers tends to lead to "inverted control flow"/"stack ripping"
Tasks:
Units of work that are doled out by a manager to a pool of workers
The workers can be threads, processes or machines
Of course the kind of worker a task library uses has a significant impact on how one implements the tasks
In this list of inconsistently and confusingly used terminology, "task" takes the crown. Particularly in the embedded systems community, "task" is sometimes used to mean "process", "thread" or "event handler" (usually called an "interrupt service routine"). It is also sometimes used generically/informally to refer to any kind of unit of computation.
One pet peeve that I can't stop myself from airing: I dislike the use of the phrase "true concurrency" for "processor parallelism". It's quite common, but I think it leads to much confusion.
For most applications, I think task-based frameworks are best for parallelization. Most of the popular ones (Intel's TBB, Apple's GCD, Microsoft's TPL & PPL) use threads as workers. I wish there were some good alternatives that used processes, but I'm not aware of any.
If you're interested in concurrency (as opposed to processor parallelism), event handlers are the safest way to go. Cooperative threads are an interesting alternative, but a bit of a wild west. Please do not use threads for concurrency if you care about the reliability and robustness of your software.
Protothreads are just a switch case implementation that acts like a state machine but makes implementation of the software a whole lot simpler. It is based around idea of saving a and int value before a case label and returning and then getting back to the point after the case by reading back that variable and using switch to figure out where to continue. So protothread are a sequential implementation of a state machine.
Protothreads are great when implementing sequential state machines. Protothreads are not really threads at all, but rather a syntax abstraction that makes it much easier to write a switch/case state machine that has to switch states sequentially (from one to the next etc..).
I have used protothreads to implement asynchronous io: http://martinschroder.se/asynchronous-io-using-protothreads/

When Should I Use Threads?

As far as I'm concerned, the ideal amount of threads is 3: one for the UI, one for CPU resources, and one for IO resources.
But I'm probably wrong.
I'm just getting introduced to them, but I've always used one for the UI and one for everything else.
When should I use threads and how? How do I know if I should be using them?
Unfortunately, there are no hard and fast rules to using Threads. If you have too many threads the processor will spend all its time generating and switching between them. Use too few threads you will not get the throughput you want in your application. Additionally using threads is not easy. A language like C# makes it easier on you because you have tools like ThreadPool.QueueUserWorkItem. This allows the system to manage thread creation and destruction. This helps mitigate the overhead of creating a new thread to pass the work onto. You have to remember that the creation of a thread is not an operation that you get for "free." There are costs associated with starting a thread so that should always be taken into consideration.
Depending upon the language you are using to write your application you will dictate how much you need to worry about using threads.
The times I find most often that I need to consider creating threads explicitly are:
Asynchronous operations
Operations that can be parallelized
Continual running background operations
The answer totally depends on what you're planning on doing. However, one for CPU resources is a bad move - your CPU may have up to six cores, plus hyperthreading, in a retail CPU, and most CPUs will have two or more. In this case, you should have as many threads as CPU cores, plus a few more for scheduling mishaps. The whole CPU is not a single-threaded beast, it may have many cores and need many threads for 100% utilization.
You should use threads if and only if your target demographic will virtually all have multi-core (as is the case in current desktop/laptop markets), and you have determined that one core is not enough performance.
Herb Sutter wrote an article for Dr. Dobb's Journal in which he talks about the three pillars of concurrency. This article does a very good job of breaking down which problems are good candidates for being solved via threading constructs.
From the SQLite FAQ: "Threads are evil. Avoid Them." Only use them when you absolutely have to.
If you have to, then take steps to avoid the usual carnage. Use thread pools to execute fine-grained tasks with no interdependencies, using GUI-framework-provided facilities to dispatch outcomes back to the UI. Avoid sharing data between long-running threads; use message queues to pass information between them (and to synchronise).
A more exotic solution is to use languages such as Erlang that are explicit designed for fine-grained parallelism without sacrificing safety and comprehensibility. Concurrency itself is of fundamental importance to the future of computation; threads are simply a horrible, broken way to express it.
The "ideal number of threads" depends on your particular problem and how much parallelism you can exploit. If you have a problem that is "embarassingly parallel" in that it can be subdivided into independent problems with little to no communication between them required, and you have enough cores that you can actually get true parallelism, then how many threads you use depends on things like the problem size, the cache line size, the context switching and spawning overhead, and various other things that is really hard to compute before hand. For such situations, you really have to do some profiling in order to choose an optimal sharding/partitioning of your problem across threads. It typically doesn't make sense, though, to use more threads than you do cores. It is also true that if you have lots of synchronization, then you may, in fact, have a performance penalty for using threads. It's highly dependent on the particular problem as well as how interdependent the various steps are. As a guiding principle, you need to be aware that spawning threads and thread synchronization are expensive operations, but performing computations in parallel can increase throughput if communication and other forms of synchronization is minimal. You should also be aware that threading can lead to very poor cache performance if your threads end up invalidating a mutually shared cache line.

Resources