C++11 threading vs. OpenMP for simple parallel loops. Which, when? - multithreading

This is something of a follow-up to this other question of mine.
I would like to know if parallelized loops with a reduction operation, like a parallelized integration, belongs to the domain of applicability of C++11 threading or if OpenMP is best suited for tasks like this.
Now, consider the same setting but with threads executing computations that may throw exceptions. Does it change the scenario? Would now C++11 threading be best suited?
Thank you.

IMO, I would prefer OpenMP for any HPC / scientific and engineering computing codes. It more directly targets data parallelism. C++11 threading represents more task parallelism, which is preferable for other kinds of software (e.g., network server applications).
The situations might change in the future, there are some efforts to integrate more parallelism into C++, such as parallel STL algorithms. However, we now even do not know how this parallelism will look like.
You also rarely build codes from scratch. There are many performance-aware multi-threaded libraries that support OpenMP (sorting, linear algebra, ...), however few that support C++11 threads.

As best as I can determine, OpenMP represents greater performance potential, simply because there are a lot more tricks a compiler can use (particularly if your cpu supports vectorized computations) if it can be directly instructed to parallelize a construct. Host/dispatch threading models (like the threading models in Java and C++11) can't really do that without remarkably intelligent code analysis tools.
However, OpenMP does represent a tax on both code readability and design flexibility. Parallel execution of heterogeneous tasks is possible in OpenMP, but much more verbose to implement, and much more difficult to parse. And because it depends on preprocessor macros (which C++ purists don't like anyways) it's virtually impossible to set dynamic state about the threading model itself.
Personally, having worked on enterprise level code, I think I prefer Host/dispatch threading (aka, C++11 threads). It may represent a performance sacrifice, but as the saying goes: "Processor Cycles are much cheaper than Developer Cycles". And if you really, really are in a performance constrained environment, it either means an algorithm problem, and switching to OpenMP probably wouldn't fix it; or, it means you should probably be looking into compute cards or OpenCL/Cuda programming.

Related

How Haskell handles parallel computing on a multicore machine/cluster

I'm considering a new language to learn those days to be used in high performance computing on a cluster of computers we have, among those languages, I'm considering Haskell.
I have read some about Haskell, but still have questions about using Haskell in high performance and distributed computing, which the language is known for, but I read some debates about Haskell is not good for those type of systems due to laziness, I can summarize my questions in the following lines:
Haskell uses green threads, which is great for handling big number of concurrent connections, but what happens when one of tasks takes longer than average and blocks the rest, does the whole thread block (Node.js style), forward the next task to another processor/thread (Golang), use reductions technique (Erlang), which kicks the task out of processing context after a pre-determined number of ticks, or else?
In a distributed computing environment, what happens to lazily-evaluated functions, do they have to be forced strict?
If one function/module requires strict evaluation, but it depends on other lazy functions/modules, shall I modify the code of other functions/modules to make them strict as well, or the compiler will handle this to me and force everything in that chain to strict or lazy.
When processing a very large sequence of data, how does Haskell handle parallel processing, is it by following some kind of implicit map-reduce technique, or I have do it by myself.
Is there a clustering abstract in the language, that handles the computing power for me, that automatically forwards the next task to the free processor wherever it is, be it on the same computer or another computer in the same cluster.
How does Haskell ensure fair-share of work is evenly distributed to all the available cores on the same computer or on the available cluster.
GHC uses a pool of available work (called sparks) and a work-stealing system: when a thread runs out of work, it will look for work in the pool or on the work queues of other threads that it can steal.
There is no built-in support for distributed computing as there is in (say) Erlang. The semantics are whatever your implementation defines. There are existing implementations like Cloud Haskell that you can look at for examples.
Neither. Haskell will automatically do whatever work is necessary to provide a value that is demanded and no more.
Haskell (and GHC in particular) does not do anything to automatically parallelize evaluation because there is no known universal strategy for parallelizing that is strictly better than not parallelizing. See Why is there no implicit parallelism in Haskell? for more info.
No. See (2).
For the same machine, it uses the pool of sparks and the work-stealing system described above. There is no notion of "clustering".
For an overview of parallel and concurrent programming in Haskell, see the free book of the same name by Simon Marlow, a primary author of GHC's runtime system.
Multithreading
As far as SMP parallelism† is concerned, Haskell is very effective. It's not quite automatic, but the parallel library makes it really easy to parallelise just about anything. Because the sparks are so cheap, you can be pretty careless and just ask for lots of parallelism; the runtime will then know what to do.
Unlike in most other languages, it is not a big problem if you have highly branched data structures, tricky dynamic algorithms etc. – thanks to the purely functional paradigm, parallel Haskell never needs to worry about locks when accessed data is shared.
I think the biggest caveat is memory: GHC's garbage collector is not concurrent, and the functional style is rather allocation-happy.
Apart from that, it's possible to write programs that look like they're parallel, but really don't do any work at all but just start and immediately return because of laziness. Some testing and experience is still necessary. But laziness and parallelism are not incompatible; at least not if you make sure you have big enough “chunks” of strictness in it. And forcing something strict is largely trivial.
Simpler, common parallelism tasks (which could be expressed in a map-reduce manner, or the classic array-vector stuff – the ones which are also easy in many languages) can generally be handled even easier in Haskell with libraries that parallelise the data structures; the best-known of these is repa.
Distributed computing
There has been quite some work on Cloud Haskell, which is basically Erlang in library form. This kind of task is less straightforward: the idea of any explicit message sending is a bit against Haskell's grain, and many aspects of the workflow become more cumbersome if the language is so heavily focused on its strong static typing (which is in Haskell otherwise often a huge bonus that doesn't just improve safety and performance but also makes it easier to write).
I think it's not far off to use Haskell in a distributed concurrent manner, but we can't say it's mature in that role yet. For distributed concurrent tasks, Erlang itself is certainly the way to go.
Clusters
Honestly, Haskell won't help you much at all here. A cluster is of course in principle a special case of a distributed setup, so you could employ Cloud Haskell; but in practice the needs are very different. The HPC world today (and probably quite some time into the future) hinges on MPI, and though there is a bit of existing work on MPI bindings, I haven't found them usable, at least not just like that.
MPI is definitely also quite against Haskell's grain, what with it's FORTRAN-oriented array centrism, weird ways of handling types and so on. But unless you go nuts with Haskell's cool features (though often it is so tempting!) there is no reason you couldn't write typical number-crunching code also in Haskell. The only problem is again support/maturity, but it's a considerable problem; so for cluster computing I'd recommend C++, Python or Julia instead.
An interesting alternative is to generate MPI-parallelised C or C++ code from Haskell. Paraiso is one nice project that does this.
Pipe dreams
I have often though about what could be done to make the distributed computing feasible in idiomatic Haskell. In principle I believe laziness could be a big help there. The model I'd envision is to let all machines compute independently the same program, but make use of the fact that Haskell evaluation has generally no predetermined order. The order would be randomised on each machine. Also the runtime would track how long some computation branch took to complete, and how big the result is. If a result is deemed both expensive and compact enough to warrant it, it would then be broadcast to the other nodes, together with some suitable hash that would allow them to shortcut that computation.
Such a system would never be quite as efficient as a hand-optimised MPI application, but it could at least offer the same asymptotics in many cases. And it could handle vastly more complex algorithms with ease.
But again, that's totally just my vague hopes for the not-so-near future.
†You said concurrency (which isn't so much about computation as about interaction), but it seems your question is in essence about pure computations?

How well do common languages perform multi-threading?

In my CS class we're discussing threads and processes. I'm curious to know what common programming languages (Java, C/C++, C#, Python) can actually implement multi-threading and, if they do, how efficiently they do it.
We were shown a simple multi-threading structure in C but they didn't demonstrate the difference by running it or by a chart of collected results from a previous test. I assume that the gains for some languages using multi-threading may be negligible
EDIT
PDizzle pointed out that the gains in efficiency isn't necessarily dependent upon the language but rather what the applications/software in question require, as well as how well it is implemented for said application/software
When a program creates a separate thread for processing, it all boils down to the program making a call to the operating system to request resources for a thread.
Each operating system has an API programming languages can request multi-threading to use in a program. The implementation is platform dependent. C++ (now) has the std::thread that has operating system dependent calls. Java has classes that implement calls from the virtual machine to the operating system for requesting a thread.
I assume that the gains for some languages using multi-threading may
be negligible
No, the gains from using multi-threading in general may be negligible depending on the application requirements. I would say it's more important how an application uses threading to accomplish a task than worry about the overhead each language has to access multi-threading.
I think most modern languages do multitasking well. Modern being c++11 ,java, c#, d etc.
However most programs don't benefit from multitasking not because of the language in use, but because the algorithm being multi threaded Doesn't benefit from parallel processing. Think sorting algorithm and the like.

Mutexes, atomic and fences : what offers the best tradeoff and portability ? C++11

I'm trying to get into something deeper to better understand how many options do I have when writing multi-threaded applications in C++ 11.
In short I see this 3 options so far:
mutexes with explicit locking and freeing mechanism, they keep the threading in sync by locking and freeing, this is costly and doesn't guarantee the ordering of the execution of my code, but often times this solution is quite portable among different memory models.
atomic operations, since atomic = 1single operation without a race and it is always consistent, the sync is accomplished without locking and freeing, there is no need for locking without a race, with highly optimized atomic operations, but atomics still can't guarantee the order in which my code will be executed.
fences, they create a block in my code where nothing can't be re-ordered by the compiler, are less flexible and they tend to be costly in terms of code maintenance because I always have to keep an eye on what is really being executed and in what order, but they also improve caching techniques and among this 3 solutions they are probably the one with the most predictable behaviour.
This is more or less the core of what I got from the first lessons about threading and memory models, my problems is:
I was going for lockfree data structures and atomics to achieve flexibility and good performances, the problem here is the fact that apparently an X86 machine performs memory re-ordering differently from an ARM one and I would like to keep my code portable as much as possible at least across this 2 platforms, so what kind of approach you can suggest to write a portable multi-threaded software when 2 platforms are not guarantee to have the same re-ordering mechanisms ? Or atomic operations are the best choice as it is by now and I got all this wrong ?
For example I noticed that the Intel TBB library ( which is not C++11 code ) is being ported to ARM/Android with heavy modifications on the part dedicated to the atomic, so maybe I can write portable multi-threaded code in C++11, with lockfree data structures, and optimize the part about atomic later on when porting my library to another platform ?
The issues surrounding multi-threaded programming are not language-specific or architecture-specific. You are better off studying them first with a generalized view - and only after, as a second step, specializing your general understanding to specific languages, libraries, platforms, etc, etc.
The textbook required when I went to school was:
Principles of Concurrent and Distributed Programming - Ben-Ari
The second edition is 2006 I believe. There may be better ones, but this should suffice for starters.
Yep, X86 and ARM have different memory models.
The C++11 memory model is however not platform-specific, it has the same behavior everywhere.
That means implementation of the C++11 atomics is different on each platform - on x86, which has a fairly strong memory model, the implementation of std::atomic might get away without special assembler instructions when storing a value, while on ARM, the implementation needs special locking or fence instructions internally.
So you can simply use the atomic classes in C++11, they will work the same on all platforms. If you want to, you can even tweak the memory order if you are absolutely sure what you are doing. A weaker memory order might be faster since the implementation of the atomics might need less assembler instructions for locks and fences internally.
I can highly recommend watching Herb Sutter's talk Atomic Weapons for some detailed explanations about this.

Does pthreads provide any advantages over GCD?

Having recently learned Grand Central Dispatch, I've found multithreaded code to be pretty intuitive(with GCD). I like the fact that no locks are required(and the fact that it uses lockless data structures internally), and that the API is very simple.
Now, I'm beginning to learn pthreads, and I can't help but be a little overwhelmed with the complexity. Thread joins, mutexes, condition variables- all of these things aren't necessary in GCD, but have a lot of API calls in pthreads.
Does pthreads provide any advantages over GCD? Is it more efficient? Are there normal-use cases where pthreads can do things that GCD can not do(excluding kernel-level software)?
In terms of cross-platform compatibility, I'm not too concerned. After all, libdispatch is open source, Apple has submtited their closure changes as patches to GCC, clang supports closures, and already(e.x. FreeBSD), we're starting to see some non-Apple implementations of GCD. I'm mostly interested in use of the API(specific examples would be great!).
I am coming from the other direction: started using pthreads in my application, which I recently replaced with C++11's std::thread. Now, I am playing with higher-level constructs like the pseudo-boost threadpool, and even more abstract, Intel's Threading Building Blocks. I would consider GCD to be at or even higher than TBB.
A few comments:
imho, pthread is not more complex than GCD: at its basic core, pthread actually contains very few commands (just a handful: using just the ones mentioned in the OP will give you 95%+ of the functionality that you ever need). Like any lower-level library, it's how you put them together and how you use it which gives you its power. Don't forget that the ultimately, libraries like GCD and TBB will call a threading library like pthreads or std::thread.
sometimes, it's not what you use, but how you use it, which determines success vs failure. As proponents of the library, TBB or GCD will tell you about all the benefits of using their libraries, but until you try them out in a real application context, all of it is of theoretical benefit. For example, when I read about how easy it was to use a finely-grained parallel_for, I immediately used it in a task for which I thought could benefit from parallelism. Naturally, I, too, was drawn by the fact that TBB would handle all the details about optimal loading balancing and thread allocation. The result? TBB took five times longer than the single-threaded version! But I do not blame TBB: in retrospect, this is obviously a case of a misuse of the parallel_for: when I read the fine-print, I discovered the overhead involved in using parallel_for and posited that in my case, the costs of context-switching and added function calls outweighed the benefits of using multiple threads. So you must profile your case to see which one will run faster. You may have to reorganize your algorithm to use less threading overhead.
why does this happen? How can pthread or no threads be faster than a GCD or a TBB? When a designer designs GCD or TBB, he must make assumptions about the environment in which tasks will run. In fact, the library must be general enough that it can handle strange, unforseen use-cases by the developer. These general implementations will not come for free. On the plus-side, a library will query the hardware and the current running environment to do a better job of load-balancing. Will it work to your benefit? The only way to know is to try it out.
is there any benefit to learning lower-level libraries like std::thread when higher-level libraries are available? The answer is a resounding YES. The advantage of using higher-level libraries is, abstraction from the implementation details. The disadvantage of using higher-level libraries is also abstraction from the implementation details. When using pthreads, I am supremely aware of shared state and lifetimes of objects, because if I let my guard down, especially in a medium to large size project, I can very easily get race conditions or memory faults. Do these problems go away when I use a higher-level library? Not really. It seems like I don't need to think about them, but in fact, if I get sloppy with those details, the library implementation will also crash. So you will find that if you understand the lower-level constructs, all those libraries actually make sense, because at some point, you will be thinking about implementing them yourself, if you use the lower-level calls. Of course, at that point, it's usually better to use a time-tested and debugged library call.
So, let's break down the possible implementations:
TBB/GCD library calls: greatest benefit is for beginners of threading. They have lower barriers to entry compared to learning lower level libraries. However, they also ignore/hide some of the traps of using multi-threading. Dynamic load balancing will make your application more portable without additional coding on your part.
pthread and std::thread calls: there are actually very few calls to learn, but to use them correctly takes attention to detail and deep awareness of how your application works. If you can understand threads at this level, the APIs of higher-level libraries will certainly make more sense.
single-threaded algorithm: let us not forget the benefits of a simple single-threaded segment. For most applications, a single-thread is easier to understand and much less error-prone than multi-threading. In fact, in many cases, it may be the appropriate design choice. The fact of the matter is, a real application goes through various multi-threading phases and single-threading phases: there may be no need to be multi-threaded all the time.
Which one is fastest? The surprising truth is, it could be any of the three of the above. To get speed benefits of multi-threading, you may need to drastically reorganize your algorithms. Whether or not the benefits outweigh the costs is highly case-dependent.
Oh, and the OP asked about cases where a thread_pool is not appropriate. Easy case: if you have a tight loop that does not require many cycles per loop to compute, using thread_pool may cost more than the benefits without serious reworking. Also be aware of the overhead of function calls like lambda through thread pools vs the use of a single tight loop.
For most applications, multi-threading is a kind of optimization, so do it at the right time and in the right places.
That overwhelming feeling that you are experiencing.. that's exactly why GCD was invented.
At the most basic level there are threads, pthreads is a POSIX API for threads so you can write code in any compliant OS and expect it to work. GCD is built on top of threads (although I'm not sure if they actually used pthreads as the API). I believe GCD only works on OS X and iOS — that in a nutshell is its main disadvantage.
Note that projects that make heavy use of threads and require high performance implement their own version of thread pools. GCD allows you to avoid (re)inventing the wheel for the umpteenth time.
GCD is an Apple technology, and not the most cross platform compatible; pthread are available on just about everything from OSX, Linux, Unix, Windows.. including this toaster
GCD is optimized for thread pool parallelism. Pthreads are (as you said) very complex building blocks for parallelism, you are left to develop your own models. I highly recommend picking up a book on the topic if you're interested in learning more about pthreads and different models of parallelism.
As any declarative/assisted approach like openmp or Intel TBB GCD should be very good at embarrassingly parallel problems and will probably easily beat naïve manually pthread-ed parallel sort. I would suggest you still learn pthreads though. You'll understand concurrency better, you'd be able to apply right tool in each particular situation, and if for nothing else - there's ton of pthread-based code out there - you'd be able to read "legacy" code.
Usual: 1 task per Pthread implementations use mutexes (an OS feature).
GCD:
1 task per block, grouped into queues. 1 thread per virtual CPU can get a queue and run without mutexes through all the tasks. This reduces thread management overhead and mutex overhead, which should increase performance.
GCD abstracts threads and gives you dispatch queues. It creates threads as it deems necessary taking into account the number of processor cores available.
GCD is Open Source and is available through the libdispatch library. FreeBSD includes libdispatch as of 8.1. GCD and C Blocks are mayor contributions from Apple to the C programming community. I would never use any OS that doesn't support GCD.

switch to parallel coding

we all writing code for single processor.
i wonder when we all are able to write code on multi processors?
what do we need (software tools, logic, algorithms) for this switching?
edit: in my view, as we do many task parallely, same way we need to convert those real life solutions(algorithms) to computer lang. just as OOPs coding did for procedural coding. OOPs is more real life coding style than procedural one. so i hope for that kind of solutions.
I think the most important requirement is a good language that has native constructs that support parallelism or one that can automatically generate parallel code. There are quite a few languages that fit that description, but none of them is popular enough to really be considered for mainstream use. That, in turn is caused by several things:
By their very nature, these languages are very different from today's imperative languages, and are therefor harder to learn (or at least seem that way).
They often lack good tools and libraries, making them unusable for any "real" project.
Of course, if it were more popular more people would be willing to learn it and there would be more support, so it's a kind of cycle that's pretty hard to break out of. I guess all we can do is hope. :)
An example of a language designed with heavy parallelization in mind is Erlang - and it's actually used in commercial projects.
What we need are natural abstractions for highly-concurrent algorithms. Actors (think: Erlang) go a long way in this direction, but they aren't a one-size-fits-all solution. Some more specific abstractions like fork/join or map/reduce can be even easier to apply to common problems.
The trick with all of these concurrency abstractions is they require functional-style programming. Concurrency doesn't mesh well with shared mutable state. As they say, "Locks considered harmful". Since most developers come from a strictly imperative background, switching to a shared-nothing continuation passing approach is often extremely challenging.
Incidentally, with respect to concurrency abstractions, Clojure has some very interesting features in this direction. Not only does it have sort-of actors, but it also defines a transactional memory model (think: databases) along with a global, atomic references mechanism. These two features allow concurrent operations to share "mutable" state without ever having to worry about locking or race conditions.
In the end, it comes down to education. Much of the needed theoretical work into concurrency abstractions has already been done, we just need to accept it. Unfortunately, as Erlang and Haskell prove, sometimes the best ideas remain relegated to an extremely fringe demographic. Hopefully efforts like Scala and Clojure will succeed in bringing the more advanced abstractions into the mainstream by sneaking them onto an existing, well-supported platform (the JVM).
Unfortunately for massive concurrent programming - unless there is a breakthrough in compilers to help, we will be throwing out a lot of what we know about algorithms (I think Don Knuth even said that). Read about Erlang for a glimpse of this possible future.
There are several tools/languages that are popular or are gaining popularity. If you use FORTRAN, C, or C++, you can use OpenMP (not too hard to implement) or the Message Passing Interface (MPI) libraries (powerful and greatest speedup potential, but also complex and difficult). OpenMP uses preprocessor directives to mark areas that can be parallelized, especially loops. MPI uses messages that pass data back and forth between processes, and the greatest difficulty is keeping everything synchronized without hitting bottlenecks and keeping processes waiting. I would say MPI is definitely on the way out, however. It's become clear in the scientific/high-performance computing communities that the speedup is rarely worth the additional development time.
As for up and coming languages, check out Fortress. It's still being designed, but the goal is to create a language even easier for scientific computing than FORTRAN. Programs will be specified in a very high level mathematical syntax. Additionally, parallelism will be implicit; the programmer will have to work to do things in serial. Plus, it's being championed by Sun and is based on java, so it will be portable.
There is no simple answer, and in many ways even the complex answers are currently inadequate or incomplete. You'll get a better answer if you are more specific about the replies you want: pointers to dev libraries and tools, instructional materials, pointers to current research projects and issues in this area, or something else?
The most important requirement is to be able to split your problem into smaller problems that can be solved independently of each other. Once you've worked out how you're going to do that, everything else is easier to think about and further questions of implementation (e.g. "parts of my calculation depend on other parts - how do I wait for them to have finished?") become concrete, specific things you can research or ask here about.
for java you can now look to Parallel Java Library or DPJ(deterministic Parallel Java!)
It will offer you great help in extracting parallelism from codes!!

Resources