Boost: multithread performance, reuse of threads/sockets - multithreading

I'll first describe my task and then present my questions below.
I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.
We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.
The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.
Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.
My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.
My questions are:
Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
Any hints as to whats going wrong?
Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)

320 threads is chump change in terms of resources, but the scheduling may pose issues.
320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread.
I'd simply suggest: don't do this. It's well known that thread-per-connection doesn't scale. And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless).
Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second
Yes. A single thread can easily sustain that (on most systems). But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core.
So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores.
Q. Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue.
That is, assuming you did already use the *_async version of ASIO API functions.
I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only.
http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/examples.html

Related

Is Nginx's approach to CPUs scalability (per process epoll event queue) optimal?

Nginx's approach to CPUs scalability is based on the creation of the number of almost independent processes each owning an event queue and then using SO_REUSEPORT to spread incoming connections, IRQs, NIC packets over all cores relatively evenly.
Does it lead to better scalability (fewer kernel data sharing = fewer locks) than creating only one Linux process followed by the creation of array of threads still equal to the number of CPUs and a per thread event queue in every thread?
Here is an example of Nginx scaling up to around 32 CPUs. Disabled HT and overall number of 36 real cores could be the main reason for this, as well as relative NICs saturation or even cores GHz drop due to overheating:
https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/
Also: https://dzone.com/articles/inside-nginx-how-we-designed
Theoretically, purely asynchronous calls in a situation where you don't use (red) threads and don't need to share data would be better than using (red) threads because you will avoid the context switching overhead that forces you to bounce in and out of the kernel just to switch to another thread. You may also be less likely to get contention (threads can accidentally share something internal such as a cache-line).
In reality it could go either way depending on the program in question the vagaries of the programming language, the kernel, whether the threads were red or green, the hardware, the task, skill of the programmer etc.
Coming back to your original question NGINX's approach is going to be good and the overheads are going to be low (contrast it with Apache for example). For pure "packet pushing" it's an excellent low overhead approach but you may find a tradeoff when it comes to flexibility. It's also worth noting NGINX can spin up a worker per core so at that point it can reap the benefits of affinity (less data moving because everything is hopefully local) while still being lower overhead...
Depending on where the data is coming from and going you can likely best NGINX in specific scenarios (e.g. by using something like DPDK) or using technology built around techniques like io-uring but perhaps at some point in the future NGINX itself will adopt such technology...
So it looks like we can get hard data to answer this question by comparing Nginx to Envoy Proxy because it uses the architecture you are curious about:
Envoy uses a single process with multiple threads architecture. A single master thread controls various sporadic coordination tasks while some number of worker threads perform listening, filtering, and forwarding
While they were initially developed years apart to solve different problems, they currently have extremely similar capabilities and are often compared against one another.
Looking at one such comparison, Envoy showed better throughput and latency. Another comparison has Ambassador (based on Envoy) vs Nginx, and again Envoy shows better results.
Given this data, I'd say that yes the Single Process, Event-Loop and Thread Pool model (Envoy) seems to scale better than multiple processes with shared-IPC model (Nginx).
Does it lead to better scalability (fewer kernel data sharing = fewer locks) than creating only one Linux process followed by the creation of array of threads still equal to the number of CPUs and a per thread event queue in every thread?
From The SO_REUSEPORT socket option article:
The first of the traditional approaches is to have a single listener thread that accepts all incoming connections and then passes these off to other threads for processing. The problem with this approach is that the listening thread can become a bottleneck in extreme cases. In early discussions on SO_REUSEPORT, Tom noted that he was dealing with applications that accepted 40,000 connections per second.

Java Threads and number of Cores

Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
If so why is this the case and what are the implications of using threads greater than the number of cpu cores ?
You will probably not get any definitive answer on the question of knowing, generally speaking, how many threads an app should have, in relation to the number of core(s) the underlying computer has.
One may also argue that, at the time of PaaS software design and/or elastic clusters, the notion of a fixed number of cores for any given process might be overrated.
Still, the first part of your question :
Is it recommended that the number of threads in a java application should be less than the number of cpu cores?
This has a definitive answer, which is a "no" (once more : as a general rule). And the reason why, shortly, it that all created threads are not typically running (and maybe more importantly runable) at once, meaning there is an opportunity to optimize here.
As a support to this discussion, I'll oppose two ways of creating apps, you could call it "classical" versus "reactive", although this is not a generally acceptable division. Yet, let's use this as a support.
Classical application design
I label as classical applications that rely mostly on "blocking" calls and/or "thread per request" pattern. Consider the traditional way I/Os are done (socket communication like HTTP or Database connection, hard drive based file reading, ...) : your app thread calls some kind of read or write method, which usually triggers an OS level call, that blocks your app thread, fills some device buffer at the OS level (say, read from a disk). Once the buffer has received enough data, the OS signals your java app and thread to resume activity, and the read method returns with the data from the buffer.
The whole time the OS is working (usually just a tiny fraction of a second, but still some large amount of time compared to your typical GHz CPU speed), your Java thread is in state BLOCKED_WAITING, waiting for the OS to signal it can resume. This happens all the time. A code profiler tool, like JProfiler, or YourKit, can help you measure this time. If you do so, you'll notice that in many apps doing I/O, this is a significant part of the so-called "wall time" or "clock time" that is spent... waiting.
So we have one thread waiting, meaning it is not using any CPU time. It can be scheduled out, and the OS is free to give CPU time to anybody else.
Suppose this is a one core CPU, then NOW would be a good time to have another thread to feed the CPU. Meaning having two or more threads could be a good design to maximize CPU usage even on a single core CPU, and get the most out of your hardware.
Most "classical" web applications are typically subject to this type of CPU underuse if you follow the rule of "one thread per CPU core", because Socket communications (or more typically : the time spent waiting for a response to your SQL queries) will incur so much blocking.
If you raise the number of threads your app has, then even if one or two long running requests remain waiting, other faster requests will have runnable threads to run them, and you'll get better CPU usage, and better performance (number of concurrent requests). That is... untill something else reaches saturation (too many heavy requests on your DB, too many simultaneous hard drive reads/writes...)
Reactive app design
Recognizing this typical behavior of apps, and using different sets of OS features, some application frameworks now use non blocking patterns (even for I/O) to mitigate the above issues. Examples in the Java ecosystem are NIO based networking stacks like Netty, or actor pattern implementations like Akka.
In a typical "reactive" app, one usually abandons the "thread per request" pattern that we have in classical apps (meaning one thread is responsible for handling everything from start to finish of a given user request, and waiting when need be for external resources to become available), in favor of a vastly more modular, and non-blocking approach.
Threads are given more technical-grained bits of work to do, and each thread will hand-off work to one another and callbacks to hear back when work they depend upon is done. This "handing of" of units of work means each thread can quickly grab new units of work it is able to handle. Meaning one of two things : you achieve higher CPU usage with far fewer threads in your app (because each can grab work more efficiently, instead of just sitting "waiting") ; or you can instantiate many many more threads because they'll mostly be waiting (not saturating the CPUs), and the dynamical hand-off will still allow for good CPU usage.
Conclusion
Anyway, you don't design the number of threads solely based on the number of available cores. The nature of your implementation and work dictates the number of optimal threads to create.
On a classical app-design philosophy, the two numbers are more closely related than on a reactive one, but still, we have different profiles :
a very simple server app can accomodate many more threads than CPU cores, because it will allow for better throughput (the limit being, say, the output network bandwidth).
a SQL heavy app, should be scalled to the point where your app server will saturate the SQL backend. As your app server will be mostly waiting for your SQL server, then this is the limit
a mixed application consisting of some SQL heavy work, and some lightweight work, will need precision tuning, because you don't want the stuck threads (those blocked waiting for the DB) starving the light requests that would be served more rapidly
a compute intensive program (say, a cryptography service) will probably benefit from a number of threads close to the number CPU cores (if your algorithm is implemented in a classical way), because creating more threads than you are able to run is pointless. In an actor based implementation, creating more threads could actually be a win.

Does Akka.IO Tcp has a bottleneck in bidirectional communication?

EDIT: This is a duplicate of Does Akka Tcp support full-duplex communication? (please don’t ask the same question multiple times, the same goes for duplicating on mailing lists, this wastes the time of those who volunteer their help, reducing your chances of getting answers in the future)
I've modified Echo server from https://github.com/akka/akka/blob/master/akka-docs/rst/scala/code/docs/io/EchoServer.scala#L96
case Received(data) =>
connection ! Write(data, Ack(currentOffset))
log.debug("same {}", sender.eq(connection)) // true
buffer(data)
That means incoming and outgoing messages are handled by the same actor. So a single working thread(that takes messages from a mailbox) will process read and write operations. Looks like a potential bottleneck.
In "classical" world I can create one thread to read from a socket and another for a writing and get simultaneous communication.
Update
Discussion in google group https://groups.google.com/forum/#!topic/akka-dev/mcs5eLKiAVQ
While there is a single Actor that either reads or writes at any given point in time, each of these operations takes very few cycles since it only occurs when there are data to be read or buffer space available to be written to. The system call overhead of ~1µs means that with the default buffer sizes of 128kiB you should be able to transfer up to 100GiB/s in total, which sure is a bottleneck but probably not today and in practice (this roughly coincides with typical CPU memory bandwidth, so more data rate is currently impossible anyway). Once this changes we can split the reading and writing responsibilities between different selectors and wake up different Actors, but before doing that we’ll need to verify that there actually is a measurable effect.
The other question that needs answering is which operating system kernels actually allow concurrent operations on a single socket from multiple threads. I have not researched this yet, but I would not be surprised to find that fully independent locking will be hard to do and there might not (yet) be a reason to expend that effort.

libevent / epoll number of worker threads?

I am following this example. Line#37 says that number of worker threads should be equal of number of cpu cores. Why is that so?
If there are 10k connections and my system has 8 cores, does that mean 8 worker threads will be processing 10k connections? Why shouldn't I increase this number?
Context Switching
For an OS to context switch between threads takes a little bit of time. Having a lot of threads, each one doing comparatively little work, means that the context switch time starts becoming a significant portion of the overall runtime of the application.
For example, it could take an OS about 10 microseconds to do a context switch; if the thread does only 15 microseconds worth of work before going back to sleep then 40% of the runtime is just context switching!
This is inefficient, and that sort of inefficiency really starts to show up when you're up-scaling as your hardware, power and cooling costs go through the roof. Having few threads means that the OS doesn't have to switch contexts anything like as much.
So in your case if your requirement is for the computer to handle 10,000 connections and you have 8 cores then the efficiency sweet spot will be 1250 connections per core.
More Clients Per Thread
In the case of a server handling client requests it comes down to how much work is involved in processing each client. If that is a small amount of work, then each thread needs to handle requests from a number of clients so that the application can handle a lot of clients without having a lot of threads.
In a network server this means getting familiar with the the select() or epoll() system call. When called these will both put the thread to sleep until one of the mentioned file descriptors becomes ready in some way. However if there's no other threads pestering the OS for runtime the OS won't necessarily need to perform a context switch; the thread can just sit there dozing until there's something to do (at least that's my understanding of what OSes do. Everyone, correct me if I'm wrong!). When some data turns up from a client it can resume a lot faster.
And this of course makes the thread's source code a lot more complicated. You can't do a blocking read of data from the clients for instance; being told by epoll() that a file descriptor has become ready for reading does not mean that all the data you're expecting to receive from the client can be read immediately. And if the thread stalls due to a bug that affects more than one client. But that's the price paid for attaining the highest possible efficiency.
And it's not necessarily the case that you would want just 8 threads to go with your 8 cores and 10,000 connections. If there's something that your thread has to do for each connection every time it handles a single connection then that's an overhead that would need to be minimised (by having more threads and fewer connections per thread). [The select() system call is like that, which is why epoll() got invented.] You have to balance that overhead up against the overhead of context switching.
10,000 file descriptors is a lot (too many?) for a single process in Linux, so you might have to have several processes instead of several threads. And then there's the small matter of whether the hardware is fundamentally able to support 10,000 within whatever response time / connection requirements your system has. If it doesn't then you're forced to distribute your application across two or more servers, and that can start getting really complicated!
Understanding exactly how many clients to handle per thread depends on what the processing is doing, whether there's harddisk activity involved, etc. So there's no one single answer; it's different for different applications, and also for the same application on different machines. Tuning the clients / thread to achieve peak efficiency is a really hard job. This is where profiling tools like dtrace on Solaris, ftrace on Linux, (especially when used like this, which I've used a lot on Linux on x86 hardware) etc. can help because they allow you to understand at a very fine scale precisely what runtime is involved in your thread handling a request from a client.
Outfits like Google are of course very keen on efficiency; they get through a lot of electricity. I gather that when Google choose a CPU, hard drive, memory, etc. to put into their famously home grown servers they measure performance in terms of "Searches per Watt". Obviously you have to be a pretty big outfit before you get that fastidious about things, but that's the way things go ultimately.
Other Efficiencies
Handling things like TCP network connections can take up a lot of CPU time in it's own right, and it can be difficult to understand whereabouts in a system all your CPU runtime has gone. For network connections things like TCP offload in the smarter NICs can have a real benefit, because that frees the CPU from the burden of doing things like the error correction calculations.
TCP offload mirrors what we do in the high speed large scale real time embedded signal processing world. The (weird) interconnects that we use require zero CPU time to run them. So all of the CPU time is dedicated to processing data, and specialised hardware looks after moving data around. That brings about some quite astonishing efficiencies, so one can build a system with more modest, lower cost, less power hungry CPUs.
Language can have a radical effect on efficiency too; Things like Ruby, PHP, Perl are all very well and good, but everyone who has used them initially but has then grown rapidly ended up going to something more efficient like Java/Scala, C++, etc.
Your question is even better than you think! :-P
If you do networking with libevent, it can do non-blocking I/O on sockets. One thread could do this (using one core), but that would under-utilize the CPU.
But if you do “heavy” file I/O, then there is no non-blocking interface to the kernel. (Many systems have nothing to do that at all, others have some half-baked stuff going on in that field, but non-portable. –Libevent doesn’t use that.) – If file I/O is bottle-necking your program/test, then a higher number of threads will make sense! If a hard-disk is used, and the i/o-scheduler is reordering requests to avoid disk-head-moves, etc. it will depend on how much requests the scheduler takes into account to do its job the best. 100 pending requests might work better then 8.
Why shouldn't you increase the thread number?
If non-blocking I/O is done: all cores are working with thread-count = core-count. More threads only means more thread-switching with no gain.
For blocking I/O: you should increase it!

Would handling each TCP connection in a separate thread improve latency?

I have an FTP server, implemented on top of QTcpServer and QTcpSocket.
I take advantage of the signals and slots mechanism to support multiple TCP connections simultaneously, even though I have a single thread. My code returns as soon as possible to the event loop, it doesn't block (no wait functions), and it doesn't use nested event loops anywhere. That way I already have cooperative multitasking, like Win3.1 applications had.
But a lot of other FTP servers are multithreaded. Now I'm wondering if using a separate thread for handling each TCP connection would improve performance, and especially latency.
On one hand, threads add to latency because you need to start a new thread for each new connection, but on the other, with my cooperative multitasking, other TCP connections have to wait until I've returned to the main loop before their readyRead()/bytesWritten() signals can be handled.
In your current system and ignoring file I/O time one processor is always doing something useful if there's something useful to be done, and waiting ready-to-go if there's nothing useful to be done. If this were a single processor (single core) system you would have maximized throughput. This is often a very good design -- particularly for an FTP server where you don't usually have a human waiting on a packet-by-packet basis.
You have also minimized average latency (for a single processor system.) What you do not have is consistent latency. Measuring your system's performance is likely to show a lot of jitter -- a lot of variation in the time it takes to handle a packet. Again because this is FTP and not real-time process control or human interaction, jitter may not be a problem.
Now, however consider that there is probably more than one processor available on your system and that it may be possible to overlap I/O time and processing time.
To take full advantage of a multi-processor(core) system you need some concurrency.
This normally translates to using multiple threads, but it may be possible to achieve concurrency via asynchronous (non-blocking) file reads and writes.
However, adding multiple threads to a program opens up a huge can-of-worms.
If you do decide to go the MT route, I'd suggest that you consider depending on a thread-aware I/O library. QT may provide that for you (I'm not sure.) If not, take a look at boost::asio (or ACE for an older, but still solid solution). You'll discover that using the MT capabilities of such a library involves a considerable investment in learning time; however as it turns out the time to add on multithreading "by-hand" and get it right is even worse.
So I'd say stay with your existing solution unless you are worried about unused Processor cycles and/or jitter in which case start learning QT's multithreading support or boost::asio.
Do you need to start a new thread for each new connection? Could you not just have a pool of threads that acts on requests as and when they arrive. This should reduce some of the latency. I have to say that in general a multi-threaded FTP server should be more responsive that a single-threaded one. Is it possible to have an event based FTP server?

Resources