What is the Rust equivalent to Intel's tbb::concurrent_queue? - multithreading

I am looking for an equivalent of the concurrent_queue from Intel's tbb module in Rust. I have found some crates:
multiqueue
two-lock-queue
crossbeam-deque
and even
futures-pool
thread-pool
I feel like they are all doing similar things, however in their docs it seems that they are using different algorithms for the implementation.
Though I don't know a lot about programming in C++, I am pretty sure that tbb's concurrent_queue is a very fast MPMC queue implementation. You cannot be close to that performance if you only wrap a queue container in a Mutex (which is tested by one of my friends).
Since the efficiency (both latency and throughput) is the main thing I care about, what should I use in Rust? The queue could be either bounded or unbounded and I probably need Acquire-Release ordering.

I think the crossbeam::sync::MsQueue and the crossbeam::sync::SegQueue from the crossbeam crate have the same capabilities as the concurrent_queue you linked.
They are lock free queues that can be used in a non blocking way with push and try_pop.
This benchmark indicates that SegQueue is faster than MsQueue, but that may still depend on your use case.

Related

What happened to libgreen?

As far as I understand libgreen is not a part of Rust standard library anymore. Also I can't find a separate libgreen package. There are a few alternatives - coroutine, which does not provide actual green threads for now, and green-rs, which is broken. Do I right understand that for now there is no lightweight Go-like processes in Rust?
You are correct that there's no lightweight tasking library in std (or the rest of the main distribution), that green doesn't compile and that coroutine doesn't seem to fully handle the threading aspect yet. I do not know of any other library in this space.
As for what happened: the RFC linked to by that issue—RFC 230—is the canonical source of information. The summary is that it was found that the method by which green threading/IO was handled (std tried to abstract across both models, allowing them to be used interoperably automagically) was not worth the downsides. Now, std aims to just provide a minimum base-line of useful support: for IO/threading, that means "thin", safe wrappers for operating system functionality.
Read this https://aturon.github.io/blog/2016/08/11/futures/ and also:
Steve Klabnik's response in the comments:
In the beginning, Rust had only green threads. Eventually, it was
decided that a systems language without systems threads is... strange.
So we needed to add them. Why not add choice? Since the interfaces
could be the same, why not abstract over them, and you could just
choose which one you wanted?
At the same time, the problems with green threads by default were
becoming issues. Segmented stacks cause slow C interop. You need a
runtime to manage them, etc. Furthermore, the overall abstraction was
causing an unacceptable cost. The green threads weren't very green.
Plus, with the need to actually release someday looming, decisions
needed to be made regarding tradeoffs. And since Rust is supposed to
be a systems language, having 1:1 threads and basically no runtime
makes more sense than N:M threads and a runtime. . So libgreen was
removed, the interface was re-done to be 1:1 thread centric.
The 'release someday looming' is a big part of it. We want to be
really stable with Rust, and with all the things to do to actually
ship a 1.0, we didn't want to crystallize an interface we weren't
happy with. Heck, we pulled out a lot of libraries that are even less
important for similar reasons, like rand. Engineering is all about
tradeoffs, and we decided to choose minimalism.
mio is a non starter for us, as are most of the other async i/o frameworks for Rust, because we need Windows and besides we don't want
to get locked into an expensive to replace library which may get
orphaned.
Totally understood here, especially in the general case. In the
specific case, mio is going to either have Windows support, or a
windows-specific version of mio is going to be released, with a
higher-level package providing the features for all platforms. And in
this case, it's maintained by one of the people who's currently using
Rust heavily in production, so it's not likely to go away anytime
soon. But, unless you're actively involved, it's hard to know things
like that, which is, of itself an issue.
One of the reasons we were comfortable removing libgreen is that you
can write your own libraries to do different kinds of IO. 1.0 is a
strong core that we feel good about stabilizing forever, not the final
bit. Libraries like https://github.com/carllerche/mio can test out
different ways of handling things like async IO, and, when they're
mature enough, we can always pull them back in the standard library if
need be. But in the meantime, it's just one line to your Cargo.toml to
add them in.
And such text from reddit:
Unfortunately they ended up canning the greenlet support because
theirs were slower than kernel threads which in turn demonstrates
someone didn’t understand how to get a language compiler to generate
stackless coroutines effectively (not surprising, the number of
engineers wired the right way is not many in this world, but see
http://www.reddit.com/r/rust/comments/2l0a4b/do_rust_web_servers_use_libuv_through_libgreen_or/
for more detail). And they canned the async i/o because libuv is
“slow” (which it is only because it is single threaded only, plus
forces a malloc + free per async operation as the buffers must last
until completion occurs, plus it enforces a penalty over synchronous
i/o see
http://blog.kazuhooku.com/2014/09/the-reasons-why-i-stopped-using-libuv.html),
which was a real shame - they should have taken the opportunity to
replace libuv with something better (hint: ASIO + AFIO, and yes I know
they are both C++, but Rust could do with much better C++ interop than
the presently none it currently has) instead of canning
always-async-everything in what could have been an amazing step up
from C++ with most of the benefits of Erlang without the disadvantages
of Erlang.
For newcomers, there is now may, a crate that implements green threads similar to goroutines.

Mutexes, atomic and fences : what offers the best tradeoff and portability ? C++11

I'm trying to get into something deeper to better understand how many options do I have when writing multi-threaded applications in C++ 11.
In short I see this 3 options so far:
mutexes with explicit locking and freeing mechanism, they keep the threading in sync by locking and freeing, this is costly and doesn't guarantee the ordering of the execution of my code, but often times this solution is quite portable among different memory models.
atomic operations, since atomic = 1single operation without a race and it is always consistent, the sync is accomplished without locking and freeing, there is no need for locking without a race, with highly optimized atomic operations, but atomics still can't guarantee the order in which my code will be executed.
fences, they create a block in my code where nothing can't be re-ordered by the compiler, are less flexible and they tend to be costly in terms of code maintenance because I always have to keep an eye on what is really being executed and in what order, but they also improve caching techniques and among this 3 solutions they are probably the one with the most predictable behaviour.
This is more or less the core of what I got from the first lessons about threading and memory models, my problems is:
I was going for lockfree data structures and atomics to achieve flexibility and good performances, the problem here is the fact that apparently an X86 machine performs memory re-ordering differently from an ARM one and I would like to keep my code portable as much as possible at least across this 2 platforms, so what kind of approach you can suggest to write a portable multi-threaded software when 2 platforms are not guarantee to have the same re-ordering mechanisms ? Or atomic operations are the best choice as it is by now and I got all this wrong ?
For example I noticed that the Intel TBB library ( which is not C++11 code ) is being ported to ARM/Android with heavy modifications on the part dedicated to the atomic, so maybe I can write portable multi-threaded code in C++11, with lockfree data structures, and optimize the part about atomic later on when porting my library to another platform ?
The issues surrounding multi-threaded programming are not language-specific or architecture-specific. You are better off studying them first with a generalized view - and only after, as a second step, specializing your general understanding to specific languages, libraries, platforms, etc, etc.
The textbook required when I went to school was:
Principles of Concurrent and Distributed Programming - Ben-Ari
The second edition is 2006 I believe. There may be better ones, but this should suffice for starters.
Yep, X86 and ARM have different memory models.
The C++11 memory model is however not platform-specific, it has the same behavior everywhere.
That means implementation of the C++11 atomics is different on each platform - on x86, which has a fairly strong memory model, the implementation of std::atomic might get away without special assembler instructions when storing a value, while on ARM, the implementation needs special locking or fence instructions internally.
So you can simply use the atomic classes in C++11, they will work the same on all platforms. If you want to, you can even tweak the memory order if you are absolutely sure what you are doing. A weaker memory order might be faster since the implementation of the atomics might need less assembler instructions for locks and fences internally.
I can highly recommend watching Herb Sutter's talk Atomic Weapons for some detailed explanations about this.

Scope of POSIX Threads

I have been learning thread programming in Java, where there are are sophisticated APIs for thread management. I recently came across this. I am curious to know if these are used now. Is the POSIX thread obsolete or is it the standard used now for threading in C++. I am not familiar with Threading in any other language apart from Java.
phtreads are the current standard POSIX threading library. They are missing some important new things, and I hope they will be updated to accomodate them. And the C++1x standard will also have some threading primitives built in.
pthreads is mostly missing atomic value operations. For example there are no thread safe primitive counter operations that are expected to be compiled to 1-5 machine instructions.
These are needed because while the semantics of the volatile keyword seem to suggest that you might be able to use it for some of these things, this is not the case. Modern CPUs manage their L1, L2 and L3 caches in a way that frequently results in writes reads and writes being seen in different order by different CPUs. And current optimizing compilers can significantly re-ordering operations so the order in which they happen no longer bears a lot of resemblance to the order they appear in the source code.
Mutexes, even the modern Linux version that avoids any system calls unless there is contention, are too heavyweight for something like a reference count.
C and C++ could be changed so the language made these guarantees happen all the time. But that would be contrary to their spirit of being 'high-level assembly'.

How efficient is a try_lock on a mutex?

How efficient is a try_lock on a mutex? I.e. how much assembler instructions are there likely and how much time are they consuming in both possible cases (i.e. the mutex was already locked before or it was free and could be locked).
In case you have problems to answer the question, here is a how to (in case that is really unclear):
If that answer depends a lot on the OS implementation and hardware: Please answer it for common OS`s (e.g. Linux, Windows, MacOSX), recent versions of them (in case they differ a lot from earlier versions) and common hardware (x86, amd64, ppc, arm).
If that also depends on the library: Take pthread as an example.
Please also answer if they really differ at all. And if they differ, please state the differences. I.e. what do they do differently? What common algorithms are there around? Are there different algorithms around or do all common systems (common by the above list if that is unclear) have implemented mutexes just in the same way?
As of this Meta discussion, this really should be a separate question.
Also, I have asked this as a separate question from the performance of a lock because I am not sure if try_lock may behave different. Maybe also depending on the implementation. Then again, please answer it for common implementations. And this very similar/related question obviously shows that this is an interesting question which can be answered.
A mutex is a logical construction that is independent of any implementation. Operations on mutexes therefore are neither efficient nor inefficient - they are simply defined.
Your question is therefore akin to asking "How efficient is a car?", without reference to what kind of car you might be talking about.
I could implement mutexes in the real world with smoke signals, carrier pigeons or a pencil and paper. I could also implement them on a computer. I could implement a mutex with certain operations on a Cray 1, on an Intel Core 2 Duo, or on the 486 in my basement. I could implement them in hardware. I could implement them in software in the operating system kernel, or in userspace, or using some combination of the two. I might simulate mutexes (but not implement them) using lock-free algorithms that are guaranteed conflict-free within a critical section.
EDIT: Your subsequent edits don't help the situation. "In a low level language (like C or whatever)" is mostly irrelevant, because then we're into measuring language implementation performance, and that's a slippery slope at best. "[F]rom pthread or whatever the native system library provides" is similarly unhelpful, because as I said, there are so many ways that one could implement mutexes in different environments that it's not even a useful comparison to make.
This is why your question is unanswerable.

Does pthreads provide any advantages over GCD?

Having recently learned Grand Central Dispatch, I've found multithreaded code to be pretty intuitive(with GCD). I like the fact that no locks are required(and the fact that it uses lockless data structures internally), and that the API is very simple.
Now, I'm beginning to learn pthreads, and I can't help but be a little overwhelmed with the complexity. Thread joins, mutexes, condition variables- all of these things aren't necessary in GCD, but have a lot of API calls in pthreads.
Does pthreads provide any advantages over GCD? Is it more efficient? Are there normal-use cases where pthreads can do things that GCD can not do(excluding kernel-level software)?
In terms of cross-platform compatibility, I'm not too concerned. After all, libdispatch is open source, Apple has submtited their closure changes as patches to GCC, clang supports closures, and already(e.x. FreeBSD), we're starting to see some non-Apple implementations of GCD. I'm mostly interested in use of the API(specific examples would be great!).
I am coming from the other direction: started using pthreads in my application, which I recently replaced with C++11's std::thread. Now, I am playing with higher-level constructs like the pseudo-boost threadpool, and even more abstract, Intel's Threading Building Blocks. I would consider GCD to be at or even higher than TBB.
A few comments:
imho, pthread is not more complex than GCD: at its basic core, pthread actually contains very few commands (just a handful: using just the ones mentioned in the OP will give you 95%+ of the functionality that you ever need). Like any lower-level library, it's how you put them together and how you use it which gives you its power. Don't forget that the ultimately, libraries like GCD and TBB will call a threading library like pthreads or std::thread.
sometimes, it's not what you use, but how you use it, which determines success vs failure. As proponents of the library, TBB or GCD will tell you about all the benefits of using their libraries, but until you try them out in a real application context, all of it is of theoretical benefit. For example, when I read about how easy it was to use a finely-grained parallel_for, I immediately used it in a task for which I thought could benefit from parallelism. Naturally, I, too, was drawn by the fact that TBB would handle all the details about optimal loading balancing and thread allocation. The result? TBB took five times longer than the single-threaded version! But I do not blame TBB: in retrospect, this is obviously a case of a misuse of the parallel_for: when I read the fine-print, I discovered the overhead involved in using parallel_for and posited that in my case, the costs of context-switching and added function calls outweighed the benefits of using multiple threads. So you must profile your case to see which one will run faster. You may have to reorganize your algorithm to use less threading overhead.
why does this happen? How can pthread or no threads be faster than a GCD or a TBB? When a designer designs GCD or TBB, he must make assumptions about the environment in which tasks will run. In fact, the library must be general enough that it can handle strange, unforseen use-cases by the developer. These general implementations will not come for free. On the plus-side, a library will query the hardware and the current running environment to do a better job of load-balancing. Will it work to your benefit? The only way to know is to try it out.
is there any benefit to learning lower-level libraries like std::thread when higher-level libraries are available? The answer is a resounding YES. The advantage of using higher-level libraries is, abstraction from the implementation details. The disadvantage of using higher-level libraries is also abstraction from the implementation details. When using pthreads, I am supremely aware of shared state and lifetimes of objects, because if I let my guard down, especially in a medium to large size project, I can very easily get race conditions or memory faults. Do these problems go away when I use a higher-level library? Not really. It seems like I don't need to think about them, but in fact, if I get sloppy with those details, the library implementation will also crash. So you will find that if you understand the lower-level constructs, all those libraries actually make sense, because at some point, you will be thinking about implementing them yourself, if you use the lower-level calls. Of course, at that point, it's usually better to use a time-tested and debugged library call.
So, let's break down the possible implementations:
TBB/GCD library calls: greatest benefit is for beginners of threading. They have lower barriers to entry compared to learning lower level libraries. However, they also ignore/hide some of the traps of using multi-threading. Dynamic load balancing will make your application more portable without additional coding on your part.
pthread and std::thread calls: there are actually very few calls to learn, but to use them correctly takes attention to detail and deep awareness of how your application works. If you can understand threads at this level, the APIs of higher-level libraries will certainly make more sense.
single-threaded algorithm: let us not forget the benefits of a simple single-threaded segment. For most applications, a single-thread is easier to understand and much less error-prone than multi-threading. In fact, in many cases, it may be the appropriate design choice. The fact of the matter is, a real application goes through various multi-threading phases and single-threading phases: there may be no need to be multi-threaded all the time.
Which one is fastest? The surprising truth is, it could be any of the three of the above. To get speed benefits of multi-threading, you may need to drastically reorganize your algorithms. Whether or not the benefits outweigh the costs is highly case-dependent.
Oh, and the OP asked about cases where a thread_pool is not appropriate. Easy case: if you have a tight loop that does not require many cycles per loop to compute, using thread_pool may cost more than the benefits without serious reworking. Also be aware of the overhead of function calls like lambda through thread pools vs the use of a single tight loop.
For most applications, multi-threading is a kind of optimization, so do it at the right time and in the right places.
That overwhelming feeling that you are experiencing.. that's exactly why GCD was invented.
At the most basic level there are threads, pthreads is a POSIX API for threads so you can write code in any compliant OS and expect it to work. GCD is built on top of threads (although I'm not sure if they actually used pthreads as the API). I believe GCD only works on OS X and iOS — that in a nutshell is its main disadvantage.
Note that projects that make heavy use of threads and require high performance implement their own version of thread pools. GCD allows you to avoid (re)inventing the wheel for the umpteenth time.
GCD is an Apple technology, and not the most cross platform compatible; pthread are available on just about everything from OSX, Linux, Unix, Windows.. including this toaster
GCD is optimized for thread pool parallelism. Pthreads are (as you said) very complex building blocks for parallelism, you are left to develop your own models. I highly recommend picking up a book on the topic if you're interested in learning more about pthreads and different models of parallelism.
As any declarative/assisted approach like openmp or Intel TBB GCD should be very good at embarrassingly parallel problems and will probably easily beat naïve manually pthread-ed parallel sort. I would suggest you still learn pthreads though. You'll understand concurrency better, you'd be able to apply right tool in each particular situation, and if for nothing else - there's ton of pthread-based code out there - you'd be able to read "legacy" code.
Usual: 1 task per Pthread implementations use mutexes (an OS feature).
GCD:
1 task per block, grouped into queues. 1 thread per virtual CPU can get a queue and run without mutexes through all the tasks. This reduces thread management overhead and mutex overhead, which should increase performance.
GCD abstracts threads and gives you dispatch queues. It creates threads as it deems necessary taking into account the number of processor cores available.
GCD is Open Source and is available through the libdispatch library. FreeBSD includes libdispatch as of 8.1. GCD and C Blocks are mayor contributions from Apple to the C programming community. I would never use any OS that doesn't support GCD.

Resources