Does Akka.IO Tcp has a bottleneck in bidirectional communication?

Does Akka.IO Tcp has a bottleneck in bidirectional communication? - multithreading

EDIT: This is a duplicate of Does Akka Tcp support full-duplex communication? (please don’t ask the same question multiple times, the same goes for duplicating on mailing lists, this wastes the time of those who volunteer their help, reducing your chances of getting answers in the future)
I've modified Echo server from https://github.com/akka/akka/blob/master/akka-docs/rst/scala/code/docs/io/EchoServer.scala#L96
case Received(data) =>
connection ! Write(data, Ack(currentOffset))
log.debug("same {}", sender.eq(connection)) // true
buffer(data)
That means incoming and outgoing messages are handled by the same actor. So a single working thread(that takes messages from a mailbox) will process read and write operations. Looks like a potential bottleneck.
In "classical" world I can create one thread to read from a socket and another for a writing and get simultaneous communication.
Update
Discussion in google group https://groups.google.com/forum/#!topic/akka-dev/mcs5eLKiAVQ

While there is a single Actor that either reads or writes at any given point in time, each of these operations takes very few cycles since it only occurs when there are data to be read or buffer space available to be written to. The system call overhead of ~1µs means that with the default buffer sizes of 128kiB you should be able to transfer up to 100GiB/s in total, which sure is a bottleneck but probably not today and in practice (this roughly coincides with typical CPU memory bandwidth, so more data rate is currently impossible anyway). Once this changes we can split the reading and writing responsibilities between different selectors and wake up different Actors, but before doing that we’ll need to verify that there actually is a measurable effect.
The other question that needs answering is which operating system kernels actually allow concurrent operations on a single socket from multiple threads. I have not researched this yet, but I would not be surprised to find that fully independent locking will be hard to do and there might not (yet) be a reason to expend that effort.

Related

Is Nginx's approach to CPUs scalability (per process epoll event queue) optimal?

Nginx's approach to CPUs scalability is based on the creation of the number of almost independent processes each owning an event queue and then using SO_REUSEPORT to spread incoming connections, IRQs, NIC packets over all cores relatively evenly.
Does it lead to better scalability (fewer kernel data sharing = fewer locks) than creating only one Linux process followed by the creation of array of threads still equal to the number of CPUs and a per thread event queue in every thread?
Here is an example of Nginx scaling up to around 32 CPUs. Disabled HT and overall number of 36 real cores could be the main reason for this, as well as relative NICs saturation or even cores GHz drop due to overheating:
https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/
Also: https://dzone.com/articles/inside-nginx-how-we-designed

Theoretically, purely asynchronous calls in a situation where you don't use (red) threads and don't need to share data would be better than using (red) threads because you will avoid the context switching overhead that forces you to bounce in and out of the kernel just to switch to another thread. You may also be less likely to get contention (threads can accidentally share something internal such as a cache-line).
In reality it could go either way depending on the program in question the vagaries of the programming language, the kernel, whether the threads were red or green, the hardware, the task, skill of the programmer etc.
Coming back to your original question NGINX's approach is going to be good and the overheads are going to be low (contrast it with Apache for example). For pure "packet pushing" it's an excellent low overhead approach but you may find a tradeoff when it comes to flexibility. It's also worth noting NGINX can spin up a worker per core so at that point it can reap the benefits of affinity (less data moving because everything is hopefully local) while still being lower overhead...
Depending on where the data is coming from and going you can likely best NGINX in specific scenarios (e.g. by using something like DPDK) or using technology built around techniques like io-uring but perhaps at some point in the future NGINX itself will adopt such technology...

So it looks like we can get hard data to answer this question by comparing Nginx to Envoy Proxy because it uses the architecture you are curious about:
Envoy uses a single process with multiple threads architecture. A single master thread controls various sporadic coordination tasks while some number of worker threads perform listening, filtering, and forwarding
While they were initially developed years apart to solve different problems, they currently have extremely similar capabilities and are often compared against one another.
Looking at one such comparison, Envoy showed better throughput and latency. Another comparison has Ambassador (based on Envoy) vs Nginx, and again Envoy shows better results.
Given this data, I'd say that yes the Single Process, Event-Loop and Thread Pool model (Envoy) seems to scale better than multiple processes with shared-IPC model (Nginx).

Does it lead to better scalability (fewer kernel data sharing = fewer locks) than creating only one Linux process followed by the creation of array of threads still equal to the number of CPUs and a per thread event queue in every thread?
From The SO_REUSEPORT socket option article:
The first of the traditional approaches is to have a single listener thread that accepts all incoming connections and then passes these off to other threads for processing. The problem with this approach is that the listening thread can become a bottleneck in extreme cases. In early discussions on SO_REUSEPORT, Tom noted that he was dealing with applications that accepted 40,000 connections per second.

Boost: multithread performance, reuse of threads/sockets

I'll first describe my task and then present my questions below.
I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.
We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.
The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.
Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.
My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.
My questions are:
Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
Any hints as to whats going wrong?
Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)

320 threads is chump change in terms of resources, but the scheduling may pose issues.
320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread.
I'd simply suggest: don't do this. It's well known that thread-per-connection doesn't scale. And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless).
Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second
Yes. A single thread can easily sustain that (on most systems). But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core.
So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores.
Q. Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue.
That is, assuming you did already use the *_async version of ASIO API functions.
I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only.
http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/examples.html

Would handling each TCP connection in a separate thread improve latency?

I have an FTP server, implemented on top of QTcpServer and QTcpSocket.
I take advantage of the signals and slots mechanism to support multiple TCP connections simultaneously, even though I have a single thread. My code returns as soon as possible to the event loop, it doesn't block (no wait functions), and it doesn't use nested event loops anywhere. That way I already have cooperative multitasking, like Win3.1 applications had.
But a lot of other FTP servers are multithreaded. Now I'm wondering if using a separate thread for handling each TCP connection would improve performance, and especially latency.
On one hand, threads add to latency because you need to start a new thread for each new connection, but on the other, with my cooperative multitasking, other TCP connections have to wait until I've returned to the main loop before their readyRead()/bytesWritten() signals can be handled.

In your current system and ignoring file I/O time one processor is always doing something useful if there's something useful to be done, and waiting ready-to-go if there's nothing useful to be done. If this were a single processor (single core) system you would have maximized throughput. This is often a very good design -- particularly for an FTP server where you don't usually have a human waiting on a packet-by-packet basis.
You have also minimized average latency (for a single processor system.) What you do not have is consistent latency. Measuring your system's performance is likely to show a lot of jitter -- a lot of variation in the time it takes to handle a packet. Again because this is FTP and not real-time process control or human interaction, jitter may not be a problem.
Now, however consider that there is probably more than one processor available on your system and that it may be possible to overlap I/O time and processing time.
To take full advantage of a multi-processor(core) system you need some concurrency.
This normally translates to using multiple threads, but it may be possible to achieve concurrency via asynchronous (non-blocking) file reads and writes.
However, adding multiple threads to a program opens up a huge can-of-worms.
If you do decide to go the MT route, I'd suggest that you consider depending on a thread-aware I/O library. QT may provide that for you (I'm not sure.) If not, take a look at boost::asio (or ACE for an older, but still solid solution). You'll discover that using the MT capabilities of such a library involves a considerable investment in learning time; however as it turns out the time to add on multithreading "by-hand" and get it right is even worse.
So I'd say stay with your existing solution unless you are worried about unused Processor cycles and/or jitter in which case start learning QT's multithreading support or boost::asio.

Do you need to start a new thread for each new connection? Could you not just have a pool of threads that acts on requests as and when they arrive. This should reduce some of the latency. I have to say that in general a multi-threaded FTP server should be more responsive that a single-threaded one. Is it possible to have an event based FTP server?

Why is threading used for sockets?

Ever since I discovered sockets, I've been using the nonblocking variants, since I didn't want to bother with learning about threading. Since then I've gathered a lot more experience with threading, and I'm starting to ask myself.. Why would you ever use it for sockets?
A big premise of threading seems to be that they only make sense if they get to work on their own set of data. Once you have two threads working on the same set of data, you will have situations such as:
if(!hashmap.hasKey("bar"))
{
dostuff // <-- meanwhile another thread inserts "bar" into hashmap
hashmap[bar] = "foo"; // <-- our premise that the key didn't exist
// (likely to avoid overwriting something) is now invalid
}
Now imagine hashmap to map remote IPs to passwords. You can see where I'm going. I mean, sure, the likelihood of such thread-interaction going wrong is pretty small, but it's still existent, and to keep one's program secure, you have to account for every eventuality. This will significantly increase the effort going into design, as compared to simple, single-threaded workflow.
I can completely see how threading is great for working on separate sets of data, or for programs that are explicitly optimized to use threading. But for the "general" case, where the programmer is only concerned with shipping a working and secure program, I can not find any reason to use threading over polling.
But seeing as the "separate thread" approach is extremely widespread, maybe I'm overlooking something. Enlighten me! :)

There are two common reasons for using threads with sockets, one good and one not-so-good:
The good reason: Because your computer has more than one CPU core, and you want to make use of the additional cores. A single-threaded program can only use a single core, so with a heavy workload you'd have one core pinned at 100%, and the other cores sitting unused and going to waste.
The not-so-good reason: You want to use blocking I/O to simplify your program's logic -- in particular, you want to avoid dealing with partial reads and partial writes, and keep each socket's context/state on the stack of the thread it's associated with. But you also want to be able to handle multiple clients at once, without slow client A causing an I/O call to block and hold off the handling of fast client B.
The reason the second reason is not-so-good is that while having one thread per socket seems to simplify the program's design, in practice it usually complicates it. It introduces the possibility of race conditions and deadlocks, and makes it difficult to safely access shared data (as you mentioned). Worse, if you stick with blocking I/O, it becomes very difficult to shut the program down cleanly (or in any other way effect a thread's behavior from anywhere other than the thread's socket), because the thread is typically blocked in an I/O call (possibly indefinitely) with no reliable way to wake it up. (Signals don't work reliably in multithreaded programs, and going back to non-blocking I/O means you lose the simplified program structure you were hoping for)
In short, I agree with cib -- multithreaded servers can be problematic and therefore should generally be avoided unless you absolutely need to make use of multiple cores -- and even then it might be better to use multiple processes rather than multiple threads, for safety's sake.

The biggest advantage of threads is to prevent the accumulated lag time from processing requests. When polling you use a loop to service every socket with a state change. For a handful of clients, this is not very noticeable, however it could lead to significant delays when dealing with significantly large number of clients.
Assuming that each transaction requires some pre-processing and post processing (depending on the protocol this may be trivial amount of processing, or it could be relatively significant as is the case with BEEP or SOAP). The combined time to pre-process/post-process requests could lead to a backlog of pending requests.
For illustration purposes imagine that the pre-processing, processing, and post-processing stage of a request each consumes 1 microsecond so that the total request takes 3 microseconds to complete. In a single threaded environment the system would become overwhelmed if incoming requests exceed 334 requests per second (since it would take 1.002 seconds to service all requests received within a 1 second period of time) leading to a time deficit of 0.002 seconds each second. However if the system were using threads, then it would be theoretically possible to only require 0.336 seconds * (0.334 for shared data access + 0.001 pre-processing + 0.001 post processing) of processing time to complete all of the requests received in a 1 second time period.
Although theoretically possible to process all requests in 0.336 seconds, this would require each request to have it's own thread. More reasonably would be to multiple the combined pre/post processing time (0.668 seconds) by the number of requests and divide by the number of configured threads. For example, using the same 334 incoming requests and processing time, theoritically 2 threads would complete all requests in 0.668 seconds (0.668 / 2 + 0.334), 4 threads in 0.501 seconds, and 8 threads in 0.418 seconds.
If the highest request volume your daemon receives is relatively low, then a single threaded implementation with non-blocking I/O is sufficient, however if you expect occasionally bursts of high volume of requests then it is worth considering a multi-threaded model.
I've written more than a handful of UNIX daemons which have relatively low throughput and I've used a single-threaded for the simplicity. However, when I wrote a custom netflow receiver for an ISP, I used a threaded model for the daemon and it was able to handle peak times of Internet usage with minimal bumps in system load average.

How to find out the optimal amount of threads?

I'm planning to make a software with lot of peer to peer like network connections. Normally I would create an own thread for every connection to send and receive data, but in this case with 300-500+ connections it would mean continuously creating and destroying a lot of threads which would be a big overhead I guess. And making one thread that handles all the connections sequentially could probably slow down things a little. (I'm not really sure about this.)
The question is: how many threads would be optimal to handle this kind of problems? Would it be possible to calculate it in the software so it can decide itself to create less threads on an old computer with not as much resources and more on new ones?
It's a theoretical question, I wouldn't like to make it implementation or language dependant. However I think a lot of people would advice something like "Just use a ThreadPool, it will handle stuff like that" so let's say it will not be a .NET application. (I'll probably has to use some other parts of the code in an old Delphi project, so the language will be probably Delphi or maybe C++ but it's not decided yet.)

Understanding the performance of your application under load is key, as mentioned before profiling, measurements and re-testing is the way to go.
As a general guide Goetz talks about having
threads = number of CPUs + 1
for CPU bound applications, and
number of CPUs * (1 + wait time / service time)
for IO bound contexts

If this is Windows (you did mention .Net?), you should definitely implement this using I/O completion ports. This is the most efficient way to do Windows sockets I/O. There is an I/O-specific discussion of thread pool size at that documentation link.
The most important property of an I/O
completion port to consider carefully
is the concurrency value. The
concurrency value of a completion port
is specified when it is created with
CreateIoCompletionPort via the
NumberOfConcurrentThreads parameter.
This value limits the number of
runnable threads associated with the
completion port. When the total number
of runnable threads associated with
the completion port reaches the
concurrency value, the system blocks
the execution of any subsequent
threads associated with that
completion port until the number of
runnable threads drops below the
concurrency value.
Basically, your reads and writes are all asynchronous and are serviced by a thread pool whose size you can modify. But try it with the default first.
A good, free example of how to do this is at the Free Framework. There are some gotchas that looking at working code could help you short-circuit.

You could do a calculation based on cpu speed, cores, and memory space in your install and set a constant somewhere to tell your application how many threads to use. Semaphores and thread pools come to mind.
Personally I would separate the listening sockets from the sending ones and open sending sockets in runtime instead of running them as daemons; listening sockets can run as daemons.
Multithreading can be its own headache and introduce many bugs. The best thing to do is make a thread do one thing and block when processing to avoid undesired and unpredictable results.

Make the number of threads configurable.
Target a few specific configurations that are the most common ones that you expect to support.
Get a good performance profiler / instrument your code and then rigorously test with different values of 1. for all the different types of 2. till you find an optimal value that works for each configuration.
I know, this might seem like a not-so smart way to do things but i think when it comes to performance, benchmarking the results via testing is the only sure-fire way to really know how well / badly it will work.
Edit: +1 to the question whose link is posted by paxDiablo above as a comment. Its almost the same question and theres loads of information there including a very detailed reply by paxDiablo himself.

One thread per CPU, processing several (hundreds) connections.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string