Fairness of socket write() in 2 parallel connections? - linux

Suppose I have a multi-threaded program in which each of the 2 threads:
has its own socket socket_fd in default (blocking) mode
repeatedly sends data using the write(socket_fd, data, data_len) such that network becomes a bottleneck
the size of the data being passed to write (i.e. data_len) is always equal to MSS; for simplicity, assume data_len = 500
I'm wondering about the fairness of writes assuming a single network interface card, i.e.: if thread 2 calls write 9x less frequently, is there a weak guarantee that the data sent by thread 2 will be roughly 1/(1 + 9) of the total data sent within reasonable time (i.e. thread 2 will eventually send its data even though thread 1 keeps the underlying media very busy by constantly sending excessive amount of data)?
I am primarily interested in the case where thread 1 (which sends more data) uses TCP and thread 2 uses DCCP. Nevertheless, answers for the scenarios in which thread 2 uses UDP and TCP are also welcome.

It depends on the queuing discipline which schedules outgoing packets on the network interface. The default pfifo_fast, the default Linux qdisc organizes outgoing packets into fifo queues indexed by ToS field. Outgoing packets with the same ToS are sent in the order the kernel receives packets from applications.

Related

Apache Pulsar - ioThreads/listenerThreads and message ordering

We are developing an application that requires messages with the same key to be processed strictly in sequence. In addition, for performance/throughput reasons, we need to introduce parallel processing.
Parallelizing is easy - we can have a single thread receiving the messages, calculating a hash on the key, and use hash % number of workers to distribute the message to a particular blocking queue with a worker on the other side. This guarantees that messages with the same key are dispatched to the same worker, so ordering is guaranteed - as long as the receiver gets the messages in order.
The questions are:
Does increasing ioThreads and listenerThreads (default = 1) have an impact on performance, i.e. should we expect to see more messages flowing through or will I/O always be the limiting factor?
If we increase them, are we still guaranteed ordering?
The Pulsar documentation is not clear...
Does increasing ioThreads and listenerThreads (default = 1) have an impact on performance, i.e. should we expect to see more messages flowing through or will I/O always be the limiting factor?
It might, depending on various factors.
IoThreads: this is the thread pool used to manage the TCP connections with brokers. If you're producing/consuming across many topics, you'll most likely be interacting with multiple brokers and thus have multiple TCP connections opened. Increasing the ioThreads count might remove the "single thread bottleneck", though it would only be effective if such bottleneck is indeed present (most of the time it will not be the case...). You can check the CPU utilization in your consumer process, across all threads, to see if there's any thread approaching 100% (of a single CPU core).
ListenerThreads: this the thread pool size when you are using the message listener in the consumer. Typically this is the thread-pool used by application to process the messages (unless it hops to a different thread). It might make sense to increase the threads count here if the app processing is reaching the 1 CPU core limit.
If we increase them, are we still guaranteed ordering?
Yes.
IO threads: 1 TCP connection is always mapped to 1 IO thread
ListenerThreads: 1 Consumer is assigned to 1 listener thread
You may also want to look at using the new key-shared subscription type that was introduced in Pulsar 2.4. Per the documentation,
Messages are delivered in a distribution across consumers and message with same key or same ordering key are delivered to only one consumer.

How do I eliminate EAGAIN errors on blocking send() calls on Linux

I am trying to write a test program that writes data across a TCP socket through Localhost on a Linux machine (CentOS 6.5 to be exact). I have one program writing, and one program reading. The reading is done via non-blocking recv() calls triggered from an epoll. There are enough cores on the machine to handle all CPU load without contention or scheduling issues.
The sends are buffers of smaller packets (about 100 bytes), aggregated up to 1400 bytes. Changing to aggregate larger (64K) buffers makes no apparent difference. When I do the sends, after 10s of MB of data, I start getting EAGAIN errors on the sender. Verifying via fcntl, I am definitiely configured as a blocking socket. I also noticed that calling ioctl(SIOCOUTQ) whenever I get EAGAIN yields larger and larger numbers.
The receiver is slower in processing the data read than the sender is in creating the data. Adding receiver threads is not an option. The fact that it is slower is OK assuming I can throttle the sender.
Now, my understanding of blocking sockets is that a send() should block in the send until the data goes out (this is based upon past experience) - meaning, the tcp stack should force the input side to be self-throttling.
Also, the Linux manual page for send() (and other queries on the web) indicate that EAGAIN is only returned for non-blocking sockets.
Can anyone give me some understanding as to what is going on, and how I can get my send() to actually block until enough data goes out for me to put more in. I would rather not have to rework the logic to put usleep() or other things in the code. It is a lightly loaded system, yield() is insufficient to allow it to drain.

Boost: multithread performance, reuse of threads/sockets

I'll first describe my task and then present my questions below.
I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.
We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.
The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.
Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.
My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.
My questions are:
Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
Any hints as to whats going wrong?
Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
320 threads is chump change in terms of resources, but the scheduling may pose issues.
320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread.
I'd simply suggest: don't do this. It's well known that thread-per-connection doesn't scale. And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless).
Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second
Yes. A single thread can easily sustain that (on most systems). But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core.
So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores.
Q. Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue.
That is, assuming you did already use the *_async version of ASIO API functions.
I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only.
http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/examples.html

UDP server app on Linux can't receive packets from more than 150 clients

I've written a UDP C++ server app on Linux and am now load testing it to see how many clients it can handle. I find that it peaks at about 150 simultaneous clients sending packets at a rate of 2-4 per second.
Clients added after that will cause some other clients' packets to be dropped.
The server itself is not stressed, using less than 10% of the CPU and memory. The network is not stressed at all either, at about 15K bytes/second. The packets are arriving at the server (which uses one UDP socket for both read and write) at about 200 packets/second. The server threads themselves still sleep for short periods at this load level.
Question:
Any ideas on what is the bottleneck here? CPU, network and server code itself all seem to be unstressed. Will the OS be unable to handle this # of UDP packets?
The hardware is very low power - a single core Pentium equivalent at 1.5 MHz. The NIC is 100M bits/second. I'm running Ubuntu 11.1.
This article might be related: Upper limit to UDP performance on windows server 2008
Update:
The server sets up a UDP socket, then creates 3 threads and 2 queues. The first thread blocks on socket read and looks like:
while (1)
{
recvfrom(this->socket, readBuf, BUFSIZE, 0, (sockaddr *)&address, &addressLen);
pushBack(this->inputQueue, message);
}
The second thread sleeps on the inputQueue. It wakes up when a condition is signaled and processes messages. It sends processed messages to an outputQueue:
while (1)
{
sleepOnQ(this->inputQueue);
popFront(this->inputQueue);
processMessage();
pushBack(this->outputQueue, message);
}
The 3rd thread sleeps on the outputQueue and sends messages out the UDP socket to the destination. Note, it's the same socket that's used for reading.
while (1)
{
sleepOnQ(this->outputQueue);
popFront(this->outputQueue);
processMessage();
sendto(this->socket, message, ... );
}
The amount of processing per client and message is small. As I mentioned when the server is handling 200 messages/second, it's using about 10% of a really wimpy CPU.
Here are some of the kernel parameters on the system:
net.core.wmem_max = 114688
net.core.rmem_max = 114688
net.core.wmem_default = 114688
net.core.rmem_default = 114688
More info on data synchronization
The answers so far have made me think two things are going on:
the OS read buffer is filling up. But with low CPU this shouldn't happen
But #1 can happen if the threads are waiting for other events and because of that don't read the socket quickly enough.
Logging could be an issue, I'll try turning it off and report results. However, probably more important is the contention for the queues between threads. Since CPU is low, maybe the threads spend a lot of time waiting for access to the queues.
In the first iteration of this server I tried to be tricky about locking data. The server was very fast but crashed when it got to 800 packets/second. The current version locks the entire queue. Maybe I need a better way to synchronize the threads.
Question answered
The information I got here was very helpful. The problem was a bonehead error with the test client, but doing the investigation helped me eliminate the cause suggested here.
FYI here are my results. Once I fixed the problem with the client the server accepted about 800 packets/second with 70% CPU utilization. I had increased the OS read/write buffer to 12MB from 128K. I didn't test whether read buffer was filled. I doubt the OS read buffer was an issue because at top speed the server read thread was still blocking on read for short periods every 10th or 20th read.
800 packets/second is still too slow, so I removed logging from the server. This made a huge difference. Server was able to receive 2900 messages/second from 1400+ clients at 70% cpu util.
I also did some testing with whether the read thread was waiting for the lock. Even at top speed I found it never had to wait more than 1 ms, so it wasn't a factor at 2900 messages/second. Perhaps it will be on a faster CPU.
At this point the server is CPU bound and to find the next bottleneck I'll need to get on a more powerful CPU.
Thanks for your help!
The most likely reason for the lossage is that the UDP socket's incoming-packets buffer fills up before your first thread can empty it; any incoming UDP packets that are received while the buffer is already full will be dropped.
The most likely reason that the first thread can't empty the buffer fast enough to keep it from filling up is that something else is holding it off of the CPU for too long... since it sounds like you are running on a single-core CPU, this is likely the case. You might want to try setting your second and third threads to a lower priority (so that the first thread will get first dibs on the CPU whenever there is contention) and see if that helps. If that still isn't good enough, you might set your process to SCHED_RR 'real time' priority, to make sure that any other processes running in the OS don't keep your first thread away from the CPU. (You can still run your other threads at lower priority, of course, since it doesn't matter so much exactly when they execute).
If an incoming UDP datagram does not fit to UDP input buffer (usually buffer is full), then kernel will discard it.
If the buffer is full when the packet rate is only 200/s and CPU load low, then your program wastes time for waiting something else (end of sleep, some resource, etc..) than new packets.
Double check your code. And try to get rid of all sleep, nanosleep and similar sleeping functions.
If you print lot of (debug) ouputs to serial console, it may start blocking your program, because serial ports are not so fast. Try to eliminate this kind of bottlenecks also.

Socket send concurrency guarantees

If I have a single socket shared between two processes (or two threads), and in both of them I try to send a big message (bigger than the underlining protocol buffer) that blocks, is it guaranteed that both messages will be sent sequentially? Or it possible for the messages to be interleaved inside the kernel?
I am mainly interested in TCP over IP behavior, but it would be interesting to know if it varies according to socket's protocol.
You're asking that if you write() message A, then B on the same socket, is A guaranteed to arrive before B? For SOCK_STREAM (e.g. TCP) and SOCK_SEQPACKET (almost never used) sockets, the answer is an unqualified yes. For SOCK_DGRAM over the internet (i.e. UDP packets) the answer is no: packets can be reordered by the network. On a single host, a unix domain datagram socket will (on all systems I know) preserve ordering, but I don't believe that's guaranteed by any standard and I'm sure there are edge cases.
Or wait: maybe you're asking if the messages written by the two processes won't be mixed? Yes: single system calls (write/writev/sendto/sendmsg) always place their content into a file descriptor atomically. But obviously if you or your library splits that write into multiple calls, you lose that guarantee.
For UDP, if two threads write to a socket handle simultaneously, both messages will be sent as separate datagrams. Might undergo IP fragmentation if the packet is larger than MTU, but resulting datagrams will be preserved and reassembled correctly by the receiver. In other words, you are safe for UDP, except for the normal issues associated with UDP (datagram reordering, packet loss, etc...).
For TCP, which is stream based, I don't know. Your question is essentially asking the equivalent of "if two threads try to write to the same file handle, will the file still be legible?" I actually don't know the answer.
The simplest thing you can do is just use a thread safe lock (mutex) to guard send/write calls to the socket so that only on thread can write to the socket at a time.
For TCP, I would suggest having a dedicated thread for handling all socket io. Then just invent a way in which messages from the worker thrads can get asynchronously queued to the socket thread for it to send on. The socket thread could also handle recv() calls and notify the other threads when the socket connection is terminated by the remote side.
If you try to send a large message on a STREAM socket that exceeds the underlying buffer size, it's pretty much guaranteed that you you will get a short write -- the write or send call will write only part of the data (as much as will fit in the buffer) and then return the amount written, leaving you to do another write for the remaining data.
If you do this in multiple threads or processes, then each write (or send) will thus write a small portion of the message atomically into the send buffer, but the subsequent writes might happen in any order, with the result that the large buffers being sent will get interleaved.
If you send messages on DGRAM sockets, on the other hand, either the entire message will be sent atomically (as a single layer 4 packet, which might be fragmented and reassembled by lower layers of the protocol stack), or you will get an error (EMSGSIZE linux or other UNIX variants)

Resources