InfiniBand: transfer rate depends on MPI_Test* frequency - multithreading

I'm writing a multi-threaded OpenMPI application, using MPI_Isend and MPI_Irecv from several threads to exchange hundreds of messages per second between ranks over InfiniBand RDMA.
Transfers are in the order of 400 - 800KByte, generating about 9 Gbps in and out for each rank, well within the capacity of FDR. Simple MPI benchmarks also show good performance.
The completion of the transfers is checked upon by polling all active transfers using MPI_Testsome in a dedicated thread.
The transfer rates I achieve depend on the message rate, but more importantly also on the polling frequency of MPI_Testsome. That is, if I poll, say, every 10ms, the requests finish later than if I poll every 1ms.
I'd expect that if I poll evert 10ms instead of every 1ms, I'd at most be informed of finished requests 9ms later. I'd not expect the transfers themselves to delay completion by fewer calls to MPI_Testsome, and thus slow down the total transfer rates. I'd expect MPI_Testsome to be entirely passive.
Anyone here have a clue why this behaviour could occur?

The observed behaviour is due to the way operation progression is implemented in Open MPI. Posting a send or receive, no matter if it is done synchronously or asynchronously, results in a series of internal operations being queued. Progression is basically the processing of those queued operations. There are two modes that you can select at library build time: one with asynchronous progression thread and one without with the latter being the default.
When the library is compiled with async progression thread enabled, a background thread takes care and processes the queue. This allows for background transfers to commence in parallel with the user's code but increases the latency. Without async progression, operations are faster but progression can only happen when the user code calls into the MPI library, e.g. while in MPI_Wait or MPI_Test and family. The MPI_Test family of functions are implemented in such a way as to return as fast as possible. That means that the library has to balance a trade-off between doing stuff in the call, thus slowing it down, or returning quickly, which means less operations are progressed on each call.
Some of the Open MPI developers, notably Jeff Squyres, visits Stack Overflow every now and then. He could possibly provide more details.
This behaviour is hardly specific to Open MPI. Unless heavily hardware-assisted, MPI is usually implemented following the same methods.

Related

Does a large Max Degree Of Parallelism cause queuing?

I would like to know if my understanding of setting a Max Degree Of Parallelism (MDOP) value larger than a machines available processor amount causes a queueing effect that I have described below.
Please see this as a purely I/O asynchronous operation:
A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.
If there is a requirement for 100 http end points to be called and the MDOP was set to 100, this will create 100 http request tasks at the same time all run in parallel. The problem is only 16 will ever be handled at once meaning the rest are effectively queued and will be handled once a processor frees resulting in an increased response time. Also to add, the process will be solved down further due to other parts of the system demanding use of the 16 available processors.
Setting the MDOP to half the available processer count (8 for example in a 16 processor machine) means that 8 http request tasks will be in flight at any one time. The response times of the 8 requests will be minimal due to there being no queueing of the tasks as the set MDOP is well under the machines available processor resources. Further to this there are also another 8 processors available to handle any other tasks required by the machine.
The main difference is that the overall response time for 100 calls will return faster with a MDOP of 100 as all 100 tasks were started at the same time, where as with 8 there are only ever 8 requests in flight at once.
The implicit assumption made in the question are not correct.
IO operations are generally far from saturating a core. Synchronous and asynchronous request works results in different behaviours. The former is not efficient and must not be used. Both should not be limited to the number of available cores but to the maximum concurrency of the target device completing the IO operations assuming the software stack is doing its job correctly.
For synchronous requests, most of the time is spent waiting for the operation to complete. For example, for a network operation, the OS send the request buffer to the NIC which send it asynchronously on network link. It takes some time to be sure data has been sent so the NIC needs to wait a bit for this sending request to mark it as completed. It also sometimes need to wait for the link to be ready. During this time, the processor can be free and it can actually queue new requests to the NIC. Not to mention the response from the request sent will take a significant time (during which neither the processor nor the link work for this specific request). When a synchronous operation needs to wait for the target device, the IO scheduler of the OS does a context switch (assuming the user code does a proper passive wait). This enable the processor to actually start new IO requests of other threads or overlap the IO requests with computation when the load is high. If there is not enough threads to do IO operations, then this is the main issue, not the number of cores itself. Increasing the number of thread is not efficient. It just increases the number of context switches and thread migration resulting in significant overheads. Asynchronous operations should be used instead. Regarding the OS stack, they may also causes many context switches, but they are generally more efficiently scheduled by the OS. Moreover, using asynchronous IO operations remove the artificial limitation of the number of threads (ie. the maximum degree of parallelism).
For asynchronous operations, one thread can starts a lot of IO requests before they can actually be completed. Having more cores does not directly means more requests can be completed in a given fixed time. This is only true if the OS IO stack is truly parallel and if the operations are limited by the OS stack rather than the concurrency of the target device (this tends to be true nowadays for example on SSD which are massively parallel). The thing is that modern processors are very fast so few threads should theoretically be enough to saturate the queue of most target device, although in practice, not all OS stacks are efficiently designed for modern IO devices.
Every software and hardware stack have a maximum degree of parallelism meant to saturate the device and so to mitigate the latency of IO requests. Because IO latency is generally high, IO request queues are large. "Queuing" do not mean much here since requests are eventually queued anyway. The question is whether they are queued in the OS stack and not the one of the device, that is if the degree Of parallelism of the software stack (including the OS) is bigger than the one of the target device (which may or may not truly compute incoming request of its request queue in parallel). The answer is generally yes if the target application send a lot of requests and the OS stack to not provide any mechanism to regulate the amount of incoming requests. That being said, some API provides it or even guarantee it (asynchronous IO ring buffers are a good example).
Put it shortly, it depends of the exact target device, the target operating system, the OS API/stack used as well as the application itself. The system can be seen as big platform-dependent dataflow where queues are everywhere so one needs to carefully specify what "MDOP" and "queuing" means in this context.
You cannot expect anyone to know what you mean by MDOP unless you mention the precise technology in the context of which you are using this term. Microsoft SQL Server has a concept of MDOP, but you are talking about HTTP requests, so you are probably not talking about MSSQL. So, what are you talking about? Anyway, on with the question.
A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.
No, it doesn't mean that. It means that the computer can execute 16 CPU instructions simultaneously. (If we disregard pipelines, superscalar pipelines, memory contention, etc.) A "Task" is a very high-level concept which involves all sorts of things besides just executing CPU instructions. For example, it involves waiting for I/O to complete, or even waiting for events to occur, events which might be raised by other tasks.
When a system allows you to set the value of some concept such as a "degree of parallelism", this means that there is no silver bullet for that value, so depending on the application at hand, different values will yield different performance benefits. Knowing your specific usage scenario you can begin with an educated guess for a good value, but the only way to know the optimal value is to try and see how your actual system performs.
Specifically with degree of parallelism, it depends on what your threads are doing. If your threads are mostly computation-bound, then a degree of parallelism close to the number of physical cores will yield best results. If your threads are mostly I/O bound, then your degree of parallelism should be orders of magnitude larger than the number of physical cores, and the ideal value depends on how much memory each thread is consuming, because with too many threads you might start hitting memory bottlenecks.
Proof: check how many threads are currently alive on your computer. (Use some built-in system monitor utility, or download one if need be.) Notice that you have thousands of threads running. And yet look at your CPU utilization: it is probably close to zero. That's because virtually all of those thousands of threads are doing nothing most of the time but waiting for stuff to happen, like for you to press a key or for a network packet to arrive.

Why is threading used for sockets?

Ever since I discovered sockets, I've been using the nonblocking variants, since I didn't want to bother with learning about threading. Since then I've gathered a lot more experience with threading, and I'm starting to ask myself.. Why would you ever use it for sockets?
A big premise of threading seems to be that they only make sense if they get to work on their own set of data. Once you have two threads working on the same set of data, you will have situations such as:
if(!hashmap.hasKey("bar"))
{
dostuff // <-- meanwhile another thread inserts "bar" into hashmap
hashmap[bar] = "foo"; // <-- our premise that the key didn't exist
// (likely to avoid overwriting something) is now invalid
}
Now imagine hashmap to map remote IPs to passwords. You can see where I'm going. I mean, sure, the likelihood of such thread-interaction going wrong is pretty small, but it's still existent, and to keep one's program secure, you have to account for every eventuality. This will significantly increase the effort going into design, as compared to simple, single-threaded workflow.
I can completely see how threading is great for working on separate sets of data, or for programs that are explicitly optimized to use threading. But for the "general" case, where the programmer is only concerned with shipping a working and secure program, I can not find any reason to use threading over polling.
But seeing as the "separate thread" approach is extremely widespread, maybe I'm overlooking something. Enlighten me! :)
There are two common reasons for using threads with sockets, one good and one not-so-good:
The good reason: Because your computer has more than one CPU core, and you want to make use of the additional cores. A single-threaded program can only use a single core, so with a heavy workload you'd have one core pinned at 100%, and the other cores sitting unused and going to waste.
The not-so-good reason: You want to use blocking I/O to simplify your program's logic -- in particular, you want to avoid dealing with partial reads and partial writes, and keep each socket's context/state on the stack of the thread it's associated with. But you also want to be able to handle multiple clients at once, without slow client A causing an I/O call to block and hold off the handling of fast client B.
The reason the second reason is not-so-good is that while having one thread per socket seems to simplify the program's design, in practice it usually complicates it. It introduces the possibility of race conditions and deadlocks, and makes it difficult to safely access shared data (as you mentioned). Worse, if you stick with blocking I/O, it becomes very difficult to shut the program down cleanly (or in any other way effect a thread's behavior from anywhere other than the thread's socket), because the thread is typically blocked in an I/O call (possibly indefinitely) with no reliable way to wake it up. (Signals don't work reliably in multithreaded programs, and going back to non-blocking I/O means you lose the simplified program structure you were hoping for)
In short, I agree with cib -- multithreaded servers can be problematic and therefore should generally be avoided unless you absolutely need to make use of multiple cores -- and even then it might be better to use multiple processes rather than multiple threads, for safety's sake.
The biggest advantage of threads is to prevent the accumulated lag time from processing requests. When polling you use a loop to service every socket with a state change. For a handful of clients, this is not very noticeable, however it could lead to significant delays when dealing with significantly large number of clients.
Assuming that each transaction requires some pre-processing and post processing (depending on the protocol this may be trivial amount of processing, or it could be relatively significant as is the case with BEEP or SOAP). The combined time to pre-process/post-process requests could lead to a backlog of pending requests.
For illustration purposes imagine that the pre-processing, processing, and post-processing stage of a request each consumes 1 microsecond so that the total request takes 3 microseconds to complete. In a single threaded environment the system would become overwhelmed if incoming requests exceed 334 requests per second (since it would take 1.002 seconds to service all requests received within a 1 second period of time) leading to a time deficit of 0.002 seconds each second. However if the system were using threads, then it would be theoretically possible to only require 0.336 seconds * (0.334 for shared data access + 0.001 pre-processing + 0.001 post processing) of processing time to complete all of the requests received in a 1 second time period.
Although theoretically possible to process all requests in 0.336 seconds, this would require each request to have it's own thread. More reasonably would be to multiple the combined pre/post processing time (0.668 seconds) by the number of requests and divide by the number of configured threads. For example, using the same 334 incoming requests and processing time, theoritically 2 threads would complete all requests in 0.668 seconds (0.668 / 2 + 0.334), 4 threads in 0.501 seconds, and 8 threads in 0.418 seconds.
If the highest request volume your daemon receives is relatively low, then a single threaded implementation with non-blocking I/O is sufficient, however if you expect occasionally bursts of high volume of requests then it is worth considering a multi-threaded model.
I've written more than a handful of UNIX daemons which have relatively low throughput and I've used a single-threaded for the simplicity. However, when I wrote a custom netflow receiver for an ISP, I used a threaded model for the daemon and it was able to handle peak times of Internet usage with minimal bumps in system load average.

measuring http request time with node.js

I use node.js to send an http request. I have a requirement to measure how much time it took.
start = getTime()
http.send(function(data) {end=getTime()})
If I call getTime inside the http response callback, there is the risk that my callback is not being called immediately when the response cames back due to other events in the queue. Such a risk also exists if I use regular java or c# synchronous code for this task, since maybe another thread got attention before me.
start = getTime()
http.send()
end=getTime()
How does node.js compares to other (synchronous) platform - does it make my chance for a good measure better or worse?
Great observations!
Theory:
If you are performing micro-benchmarking, there exists a number of considerations which can potentially skew the measurements:
Other events in the event loop which are ready to fire along with the http send in question, and get executed sequentially before the send get a chance - node specific.
Thread / Process switching which can happen any time within the span of send operation - generic.
Kernel’s I/O buffers being in limited volume causes arbitrary delays - OS / workload / system load specific.
Latency incurred in gathering the system time - language / runtime specific.
Chunking / Buffering of data: socket [ http implementation ] specific.
Practice:
Noe suffers from (1), while a dedicated thread of Java / C# do not have this issue. But as node implements an event driven non-blocking I/O model, other events will not cause blocking effects, rather will be placed into the event queue. Only the ones which are ready will get fired, and the latency incurred due to them will be a function of how much I/O work they have to carry out, and any CPU bound actions performed in their associated callbacks. These, in practice, would become negligible and evened out in the comparison, due to the more visible effects of items (2) to (5). In addition, writes are generally non-blocking, which means they will be carried out without waiting for the next loop iteration to run. And finally, when the write is carried out, the callback is issued in-line and sequentially, there is no yielding to another event in between.
In short, if you compare a dedicated Java thread doing blocking I/O with a Node code, you will see Java measurements good, but in large scale applications, the thread context switching effort will offset this gain, and the node performance will stand out.
Hope this helps.

Why are message queues used insted of mulithreading?

I have the following query which i need someone to please help me with.Im new to message queues and have recently started looking at the Kestrel message queue.
As i understand,both threads and message queues are used for concurrency in applications so what is the advantage of using message queues over multitreading ?
Please help
Thank you.
message queues allow you to communicate outside your program.
This allows you to decouple your producer from your consumer. You can spread the work to be done over several processes and machines, and you can manage/upgrade/move around those programs independently of each other.
A message queue also typically consists of one or more brokers that takes care of distributing your messages and making sure the messages are not lost in case something bad happens (e.g. your program crashes, you upgrade one of your programs etc.)
Message queues might also be used internally in a program, in which case it's often just a facility to exchange/queue data from a producer thread to a consumer thread to do async processing.
Actually, one facilitates the other. Message queue is a nice and simple multithreading pattern: when you have a control thread (usually, but not necessarily an application's main thread) and a pool of (usually looping) worker threads, message queues are the easiest way to facilitate control over the thread pool.
For example, to start processing a relatively heavy task, you submit a corresponding message into the queue. If you have more messages, than you can currently process, your queue grows, and if less, it goes vice versa. When your message queue is empty, your threads sleep (usually by staying locked under a mutex).
So, there is nothing to compare: message queues are part of multithreading and hence they're used in some more complicated cases of multithreading.
Creating threads is expensive, and every thread that is simultaneously "live" will add a certain amount of overhead, even if the thread is blocked waiting for something to happen. If program Foo has 1,000 tasks to be performed and doesn't really care in what order they get done, it might be possible to create 1,000 threads and have each thread perform one task, but such an approach would not be terribly efficient. An second alternative would be to have one thread perform all 1,000 tasks in sequence. If there were other processes in the system that could employ any CPU time that Foo didn't use, this latter approach would be efficient (and quite possibly optimal), but if there isn't enough work to keep all CPUs busy, CPUs would waste some time sitting idle. In most cases, leaving a CPU idle for a second is just as expensive as spending a second of CPU time (the main exception is when one is trying to minimize electrical energy consumption, since an idling CPU may consume far less power than a busy one).
In most cases, the best strategy is a compromise between those two approaches: have some number of threads (say 10) that start performing the first ten tasks. Each time a thread finishes a task, have it start work on another until all tasks have been completed. Using this approach, the overhead related to threading will be cut by 99%, and the only extra cost will be the queue of tasks that haven't yet been started. Since a queue entry is apt to be much cheaper than a thread (likely less than 1% of the cost, and perhaps less than 0.01%), this can represent a really huge savings.
The one major problem with using a job queue rather than threading is that if some jobs cannot complete until jobs later in the list have run, it's possible for the system to become deadlocked since the later tasks won't run until the earlier tasks have completed. If each task had been given a separate thread, that problem would not occur since the threads associated with the later tasks would eventually manage to complete and thus let the earlier ones proceed. Indeed, the more earlier tasks were blocked, the more CPU time would be available to run the later ones.
It makes more sense to contrast message queues and other concurrency primitives, such as semaphores, mutex, condition variables, etc. They can all be used in the presence of threads, though message-passing is also commonly used in non-threaded contexts, such as inter-process communication, whereas the others tend to be confined to inter-thread communication and synchronisation.
The short answer is that message-passing is easier on the brain. In detail...
Message-passing works by sending stuff from one agent to another. There is generally no need to coordinate access to the data. Once an agent receives a message it can usually assume that it has unqualified access to that data.
The "threading" style works by giving all agent open-slather access to shared data but requiring them to carefully coordinate their access via primitives. If one agent misbehaves, the process becomes corrupted and all hell breaks loose. Message passing tends to confine problems to the misbehaving agent and its cohort, and since agents are generally self-contained and often programmed in a sequential or state-machine style, they tend not to misbehave as often — or as mysteriously — as conventional threaded code.

Thread vs async execution. What's different?

I believed any kind of asynchronous execution makes a thread in invisible area. But if so,
Async codes does not offer any performance gain than threaded codes.
But I can't understand why so many developers are making many features async form.
Could you explain about difference and cost of them?
The purpose of an asynchronous execution is to prevent the code calling the asynchronous method (the foreground code) from being blocked. This allows your foreground code to go on doing useful work while the asynchronous thread is performing your requested work in the background. Without asynchronous execution, the foreground code must wait until the background task is completed before it can continue executing.
The cost of an asynchronous execution is the same as that of any other task running on a thread.
Typically, an async result object is registered with the foreground code. The async result object can either raise an event when the background task is completed, or the foreground code can periodically check the async result object to see if its completion flag has been set.
Concurrency does not necessarily require threads.
In Linux, for example, you can perform non-blocking syscalls. Using this type of calls, you can for instance start a number of network reads. Your code can keep track of the reads manually (using handles in a list or similar) and periodically ask the OS if new data is available on any of the connections. Internally, the OS also keeps a list of ongoing reads. Using this technique, you can thus achieve concurrency without any (extra) threads, neither in your program nor in the OS.
If you use threads and blocking IO, you would typically start one thread per read. In this scenario, the OS will instead have a list of ongoing threads, which it parks when the tread tries to read data when there is none available. Threads are resumed as data becomes available.
Having the OS switch between threads might involve slightly more overhead in the form of context switching - switching program counter and register content. But the real deal breaker is usually stack allocation per thread. This size is a couple of megabytes by default on Linux. If you have a lot of concurrency in your program, this might push you in the direction of using non-blocking calls to handle more concurrency per thread.
So it is possible to do async programming without threads. If you want to do async programming using only blocking OS-calls you need to dedicate a thread to do the blocking while you continue. But if you use non-blocking calls you can do a lot of concurrent things with just a single thread. Have a look at Node.js, which have great support for many concurrent connections while being single-threaded for most operations.
Also check out Golang, which achieve a similar effect using a sort of green threads called goroutines. Multiple goroutines run concurrently on the same OS thread and they are restrictive in stack memory, pushing the limit much further.
Async codes does not offer any performance gain than threaded codes.
Asynchornous execution is one of the traits of multi-threaded execution, which is becoming more relevant as processors are packing in more cores.
For servers, multi-core only mildly relevant, as they are already written with concurrency in mind and will scale natrually, but multi-core is particularly relevant for desktop apps, which traditionally do only a few things concurrently - often just one foreground task with a background thread. Now, they have to be coded to do many things concurrently if they are to take advantage of the power of the multi-core cpu.
As to the performance - on single-core - the asynchornous tasks slow down the system as much as they would if run sequentially (this a simplication, but true for the most part.) So, running task A, which takes 10s and task B which takes 5s on a single core, the total time needed will be 15s, if B is run asynchronously or not. The reason is, is that as B runs, it takes away cpu resources from A - A and B compete for the same cpu.
With a multi-core machine, additional tasks run on otherwise unused cores, and so the situation is different - the additional tasks don't really consume any time - or more correctly, they don't take away time from the core running task A. So, runing tasks A and B asynchronously on multi-core will conume just 10s - not 15s as with single core. B's execution runs at the same time as A, and on a separate core, so A's execution time is unaffected.
As the number of tasks and cores increase, then the potential improvements in performance also increase. In parallel computing, exploiting parallelism to produce an improvement in performance is known as speedup.
we are already seeing 64-core cpus, and it's esimated that we will have 1024 cores commonplace in a few years. That's a potential speedup of 1024 times, compared to the single-threaded synchronous case. So, to answer your question, there clearly is a performance gain to be had by using asynchronous execution.
I believed any kind of asynchronous execution makes a thread in invisible area.
This is your problem - this actually isn't true.
The thing is, your whole computer is actually massively asynchronous - requests to RAM, communication via a network card, accessing a HDD... those are all inherently asynchronous operations.
Modern OSes are actually built around asynchronous I/O. Even when you do a synchronous file request, for example (e.g. File.ReadAllText), the OS sends an asynchronous request. However, instead of giving control back to your code, it blocks while it waits for the response to the asynchronous request. And this is where proper asynchronous code comes in - instead of waiting for the response, you give the request a callback - a function to execute when the response comes back.
For the duration of the asynchronous request, there is no thread. The whole thing happens on a completely different level - say, the request is sent to the firmware on your NIC, and given a DMA address to fill the response. When the NIC finishes your request, it fills the memory, and signals an interrupt to the processor. The OS kernel handles the interrupt by signalling the owner application (usually an IOCP "channel") the request is done. This is still all done with no thread whatsoever - only for a short time right at the end, a thread is borrowed (in .NET this is from the IOCP thread pool) to execute the callback.
So, imagine a simple scenario. You need to send 100 simultaneous requests to a database engine. With multi-threading, you would spin up a new thread for each of those requests. That means a hundred threads, a hundread thread stacks, the cost of starting a new thread itself (starting a new thread is cheap - starting a hundred at the same time, not so much), quite a bit of resources. And those threads would just... block. Do nothing. When the response comes, the threads are awakened, one after another, and eventually disposed.
On the other hand, with asynchronous I/O, you can simply post all the requests from a single thread - and register a callback when each of those is finished. A hundred simultaneous requests will cost you just your original thread (which is free for other work as soon as the requests are posted), and a short time with threads from the thread pool when the requests are finished - in "worst" case scenario, about as many threads as you have CPU cores. Provided you don't use blocking code in the callback, of course :)
This doesn't necessarily mean that asynchronous code is automatically more efficient. If you only need a single request, and you can't do anything until you get a response, there's little point in making the request asynchronous. But most of the time, that's not your actual scenario - for example, you need to maintain a GUI in the meantime, or you need to make simultaneous requests, or your whole code is callback-based, rather than being written synchronously (a typical .NET Windows Forms application is mostly event-based).
The real benefit from asynchronous code comes from exactly that - simplified non-blocking UI code (no more "(Not Responding)" warnings from the window manager), and massively improved parallelism. If you have a web server that handles a thousand requests simultaneously, you don't want to waste 1 GiB of address space just for the completely unnecessary thread stacks (especially on a 32-bit system) - you only use threads when you have something to do.
So, in the end, asynchronous code makes UI and server code much simpler. In some cases, mostly with servers, it can also make it much more efficient. The efficiency improvements come precisely from the fact that there is no thread during the execution of the asynchronous request.
Your comment only applies to one specific kind of asynchronous code - multi-threaded parallelism. In that case, you really are wasting a thread while executing a request. However, that's not what people mean when saying "my library offers an asynchronous API" - after all, that's a 100% worthless API; you could have just called await Task.Run(TheirAPIMethod) and gotten the exact same thing.

Resources