maximizing linux raw socket throughput - linux

I have a simple application which receives packets of fixed ethertype via raw socket (the transport is ethernet), and sends two duplicates over another interface (via raw socket):
recvfrom() //blocking
//make duplicate
//add tail
sendto(packet1);
sendto(packet2);
I want two increase throughput. I need at least 4000 frames/second, can't change packet size. How can I achieve these? The system is embedded (AM335x SoC), kernel is 4.14.40... How can I encrease the performance?

A few considerations:
I know you said you can't change "packet size", but using buffered writes and infrequent flushes might really help performance
You could enable jumbo frames
You could disable Nagle
Usually people don't want to change their packet sizes, because they expect a 1-to-1 correspondence between their send()'s and recv()'s. This is not a good thing, because TCP specifically does not ensure that your send()'s and recv()'s will have a 1-1 correspondence. They usually will be 1-1, but they are not guaranteed to do so. Transmitting data with Nagle enabled or over many router hops or without Path MTU Discovery enabled makes the 1-1 relationship less likely.
So if you use buffering, and frame your data somehow (EG 1: terminate messages with a nul byte, if your data cannot otherwise have a nul or EG 2: transfer lengths as network shorts or something, so you know how much to read), you'll likely be killing two birds with one stone - that is, you'll get better speed and better reliability.

Related

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.
To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.
When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).
TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

Capturing performance with pcap vs raw socket

When capturing network traffic for debugging, there seem to be two common approaches:
Use a raw socket.
Use libpcap.
Performance-wise, is there much difference between these two approaches? libpcap seems a nice compatible way to listen to a real network connection or to replay some canned data, but does that feature set come with a performance hit?
The answer is intended to explain more about the libpcap.
libpcap uses the PF_PACKET to capture packets on an interface. Refer to the following link.
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
From the above link
In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
inefficient. It uses very limited buffers and requires one system call to
capture each packet, it requires two if you want to get packet's timestamp
(like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 
configurable circular buffer mapped in user space that can be used to either
send or receive packets. This way reading packets just needs to wait for them,
most of the time there is no need to issue a single system call. Concerning
transmission, multiple packets can be sent through one system call to get the
highest bandwidth. By using a shared buffer between the kernel and the user
also has the benefit of minimizing packet copies.
performance improvement may vary depending on PF_PACKET implementation is used. 
From https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt -
It is said that TPACKET_V3 brings the following benefits:
 *) ~15 - 20% reduction in CPU-usage
 *) ~20% increase in packet capture rate
The downside of using libpcap -
If an application needs to hold the packet then it may need to make
a copy of the incoming packet.
Refer to manpage of pcap_next_ex.
pcap_next_ex() reads the next packet and returns a success/failure indication. If the packet was read without problems, the pointer
pointed to by the pkt_header argument is set to point to the
pcap_pkthdr struct for the packet, and the pointer pointed to by the
pkt_data argument is set to point to the data in the packet. The
struct pcap_pkthdr and the packet data are not to be freed by the
caller, and are not guaranteed to be valid after the next call to
pcap_next_ex(), pcap_next(), pcap_loop(), or pcap_dispatch(); if the
code needs them to remain valid, it must make a copy of them.
Performance penalty if application only interested in incoming
packets.
PF_PACKET works as taps in the kernel i.e. all the incoming and outgoing packets are delivered to PF_SOCKET.  Which results in an expensive call to packet_rcv for all the outgoing packets.  Since libpcap uses the PF_PACKET, so libpcap can capture all the incoming as well outgoing packets.
if application is only interested in incoming packets then outgoing packets can be discarded by setting pcap_setdirection on the libpcap handle. libpcap internally discards the outgoing packets by checking the flags on the packet metadata.
So in essence, outgoing packets are still seen by the libpcap but only to be discarded later. This is performance penalty for the application which is interested in incoming packets only.
Raw packet works on IP level (OSI layer 3), pcap on data link layer (OSI layer 2). So its less a performance issue and more a question of what you want to capture. If performance is your main issue search for PF_RING etc, that's what current IDS use for capturing.
Edit: raw packets can be either IP level (AF_INET) or data link layer (AF_PACKET), pcap might actually use raw sockets, see Does libpcap use raw sockets underneath them?

socket UDP recvfrom not working for large packets even if buffer given is enough on LINUX

I am calling a recvfrom api of a valid address , where i am trying to read data of size 9600 bytes , the buffer i have provided i of size 12KB , I am not even getting select read events.
Even tough recommended MTU size is 1.5 KB, I am able to send and receive packets of 4 KB.
I am using android NDK , (Linux) for development.
Please help . Is there a socket Option i have to set to read large buffers ?
If you send a packet larger than the MTU, it will be fragmented. That is, it'll be broken up into smaller pieces, each which fits within the MTU. The problem with this is that if even one of those pieces is lost (quite likely on a cellular connection...), the entire packet will effectively disappear.
To determine whether this is the case you'll need to use a packet sniffer on one (or both) ends of the connection. Wireshark is a good choice on a PC end, or tcpdump on the android side (you'll need root). Keep in mind that home routers may reassemble fragmented packets - this means that if you're sniffing packets from inside a home router/firewall, you might not see any fragments arrive until all of them arrive at the router (and obviously if some are getting lost this won't happen).
A better option would be to simply ensure that you're always sending packets smaller than the MTU, of course. Fragmentation is almost never the right thing to be doing. Keep in mind that the MTU may vary at various hops along the path between server and client - you can either use the common choice of a bit less than 1500 (1400 ought to be safe), or try to probe for it by setting the MTU discovery flag on your UDP packets (via IP_MTU_DISCOVER) and always sending less than the value returned by getsockopt's IP_MTU option (including on retransmits!)

Alternative to pcap (Linux)

On a Linux router I wrote a C-program which uses pcap to get the IP header, and length of the packet. In that way I am able to gather statistics and measure bandwidth based on IP. Pretty neat. :-)
Now the traffic and number of users has grown, and the old program starts to struggle. That is, the router struggles to cope with the massive amount of packets. It's over 50000 packets per second all in all in "prime time".
The program itself is pretty optimized. I don't want to show off, but I believe it's as good as it can get. It reads the IP header, and the packet length. It then converts the IP to a index (just a simple subtract), and the length of the packet is stored (accumulated) in an array. Every now and then (actually a SIGALRM) it stores the array in a MySQL database.
My question is: Is there any other way to tap into an ethernet device to get the bit-stream "cheaper" than pcap?
I can of course modify the ethernet driver to include single IP statistics gathering, but that seems a little overkill.
Basically my program is a 'tcpdump' on a busy eth0 and that will eventually kill my router.
Have you considered PF_RING? It's still the pcap-like world, but on steroids - thanks to the zero-copy mechanism:
As you see, there is a kernel module that provides low-level packet copying into the PF_RING buffer, and there is the userland part that allows to access this buffer.
Who needs PF_RING?
Basically everyone who has to handle many packets per second. The term ‘many’ changes according to the hardware you use for traffic analysis. It can range from 80k pkt/sec on a 1,2GHz ARM to 14M pkt/sec and above on a low-end 2,5GHz Xeon. PF_RING not only enables you to capture packets faster, it also captures packets more efficiently preserving CPU cycles....
I highly recommend you to use PF_RING ZC. It could be found under /userland/examples_zc. it is part of pf_ring.
you can handle and capture tens of Gbps traffics in line rate by pf_ring zc.

I need a TCP option (ioctl) to send data immediately

I've got an unusual situation: I'm using a Linux system in an embedded situation (Intel box, currently using a 2.6.20 kernel.) which has to communicate with an embedded system that has a partially broken TCP implementation. As near as I can tell right now they expect each message from us to come in a separate Ethernet frame! They seem to have problems when messages are split across Ethernet frames.
We are on the local network with the device, and there are no routers between us (just a switch).
We are, of course, trying to force them to fix their system, but that may not end up being feasible.
I've already set TCP_NODELAY on my sockets (I connect to them), but that only helps if I don't try to send more than one message at a time. If I have several outgoing messages in a row, those messages tend to end up in one or two Ethernet frames, which causes trouble on the other system.
I can generally avoid the problem by using a timer to avoid sending messages too close together, but that obviously limits our throughput. Further, if I turn the time down too low, I risk network congestion holding up packet transmits and ending up allowing more than one of my messages into the same packet.
Is there any way that I can tell whether the driver has data queued or not? Is there some way I can force the driver to send independent write calls in independent transport layer packets? I've had a look through the socket(7) and tcp(7) man pages and I didn't find anything. It may just be that I don't know what I'm looking for.
Obviously, UDP would be one way out, but again, I don't think we can make the other end change anything much at this point.
Any help greatly appreciated.
IIUC, setting the TCP_NODELAY option should flush all packets (i.e. tcp.c implements setting of NODELAY with a call to tcp_push_pending_frames). So if you set the socket option after each send call, you should get what you want.
You cannot work around a problem unless you're sure what the problem is.
If they've done the newbie mistake of assuming that recv() receives exactly one message then I don't see a way to solve it completely. Sending only one message per Ethernet frame is one thing, but if multiple Ethernet frames arrive before the receiver calls recv() it will still get multiple messages in one call.
Network congestion makes it practically impossible to prevent this (while maintaining decent throughput) even if they can tell you how often they call recv().
Maybe, set TCP_NODELAY and set your MTU low enough so that there would be at most 1 message per frame? Oh, and add "dont-fragment" flag on outgoing packets
Have you tried opening a new socket for each message and closing it immediately? The overhead may be nauseating,but this should delimit your messages.
In the worst case scenario you could go one level lower (raw sockets), where you have better control over the packets sent, but then you'd have to deal with all the nitty-gritty of TCP.
Maybe you could try putting the tcp stack into low-latency mode:
echo 1 > /proc/sys/net/ipv4/tcp_low_latency
That should favor emitting packets as quickly as possible over combining data. Read the man on tcp(7) for more information.

Resources