how to tune linux network buffer size - linux

I'm reading "Kafka The Definitive Guide", in page 35 (networking section) it says :
... The first adjustment is to change the default and maximum amount of
memory allocated for the send and receive buffers for each socket. This will significantly increase performance for large transfers. The relevant parameters for the send and receive buffer default size per socket are net.core.wmem_default and
net.core.rmem_default......
In addition to the socket settings, the send and receive buffer sizes for TCP sockets must be set separately using the net.ipv4.tcp_wmem and net.ipv4.tcp_rmem parameters.
why we should set both net.core.wmem and net.ipv4.tcp_wmem?

Short answer - r/wmem_default are used for setting static socket buffer sizes, while tcp_r/wmem are used for controlling TCP send/receive window size and buffers dynamically.
More details:
By tracking the usages of r/wmem_default and tcp_r/wmem (kernel 4.14) we can see that r/wmem_default are only used in sock_init_data():
void sock_init_data(struct socket *sock, struct sock *sk)
{
sk_init_common(sk);
...
sk->sk_rcvbuf = sysctl_rmem_default;
sk->sk_sndbuf = sysctl_wmem_default;
This initializes the socket's buffers for sending and receiving packets and might be later overridden in set_sockopt:
int sock_setsockopt(struct socket *sock, int level, int optname,
char __user *optval, unsigned int optlen)
{
struct sock *sk = sock->sk;
...
sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
...
sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
Usages of tcp_rmem are found in these functions: tcp_select_initial_window() in tcp_output.c and __tcp_grow_window(), tcp_fixup_rcvbuf(), tcp_clamp_window() and tcp_rcv_space_adjust() in tcp_input.c. In all usages this value is used for controlling the receive window and/or the socket's receive buffer dynamically, meaning it would take the current traffic and the system parameters into consideration.
A similar search for tcp_wmem show that it is only used for dynamic changes in the socket's send buffer in tcp_init_sock() (tcp.c) and tcp_sndbuf_expand() (tcp_input.c).
So when you want the kernel to better tune your traffic, the most important values are tcp_r/wmem. The Socket's size is usually overridden by the user the default value doesn't really matter. For exact tuning operations, try reading the comments in tcp_input.c marked as "tuning". There's a lot of valuable information there.
Hope this helps.

Related

netperf socket size vs buffer set for send/recv calls?

While I was trying to implement benchmark testware using netperf I happened to read its manual. Where I got this query
In the TCP_STREAM specific test there are an option to mention -s and -S to specify local(netperf client), remote(netperf server) socket buffer sizes respectively. Is that a regular BSD socket size? There is also an option to specify the local send message size -m and remote receive message size -M; Is this the total message size after all TCP/IP encapsulation? Can anybody throw some light on this. It would be great if you can illustrate using a use-case why we need these separate parameters as the BSD socket size appears to be the upper boundary here.
The socket buffer sizes (set via -s and -S) will control how much data may be outstanding on the connection at one time by affecting either the receiver's advertised window (which will be based on the SO_SNDBUF) or how much data the sender can hold waiting for ACKnowledgement (which will be based on SO_SNDBUF).
The send and receive message sizes (-m and -M) control how much data is presented in any one "send" (-m) or requested in any one "recv" (-M) call.
As TCP is a streaming protocol, it is perfectly legal/possible to make a send call with a number of bytes larger than the socket buffer(s). When the socket is blocking (as netperf uses) it simply means the send call will remain there until the last of its bytes have been put into the send socket buffer. On the receive side, one can as for more than a socket buffer's worth of data in a single receive, but the semantics are such that the call will return with however many bytes happen to be there at the time if there are any, and will return with however many bytes arrive if the socket buffer was empty at the time of the call (again because netperf uses blocking sockets/calls).

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.
As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)
Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

What is the relationship between RX/TX ring and sk_buff?

I know that each NIC has its RX/TX ring in RAM for OS receiving/transmitting packets. And one item(packet descriptor) in the ring includes physical address of a packet, length of a packet and etc. I wonder that does this descriptor point to a sk_buff? And what happens if the packet is a GSO packet?Is this true that one descriptor in the ring = one packet = one sk_buff?
I wonder that does this descriptor point to a sk_buff?
Not exactly. sk_buff is a software construct, roughly, a data structure containing meta information to describe some chunk of network data AND point to the data itself. So, NIC descriptor doesn't need to point to sk_buff - it may only point to a data buffer (DMA/physical address is used).
And what happens if the packet is a GSO packet?
It's quite ambiguous question to answer since such offloads may be implemented in software (say, by the network stack) and may be done in hardware.
In the former case there is nothing to discuss in terms of NIC SW descriptors - the upper layer application provides a contiguous chunk of data, and the network stack produces smaller packets from it, so that sk_buff-s handed over to the network driver already describe small packets.
In the latter case (HW offload) the network driver is supplied with huge chunks of data (by means of handing over single sk_buff-s or sk_buff chains to it), and the network driver in turn posts appropriate descriptors to NIC - it may be one descriptor pointing to a big chunk of data, or a handful of descriptors pointing to smaller parts of the same contiguous data buffer - it doesn't matter a lot since the offload magic will take place in the HW - the overall data chunk will be sliced and packet headers will be prepended accordingly yielding many smaller network packets to be put on wire.
Is this true that one descriptor in the ring = one packet = one sk_buff?
Strictly speaking, no. It depends. Your network driver may be asked to transmit one sk_buff describing one data buffer. However, your driver under certain circumstances may decide to post multiple descriptors pointing to the same chunk of data but with different offsets - i.e. submission will be done in parts and there will be multiple descriptors in the NIC's ring related to a single sk_buff. Also, one packet is not always the same as one sk_buff - a packet may be presented as a handful of segments each described with a separate sk_buff forming an sk_buff chain (please find the next and prev fields in sk_buff).
The Linux kernel uses an sk_buff data structure to describe each
packet. When a packet arrives at the NIC, it invokes the DMA engine to
place the packet into the kernel memory via empty sk_buff's stored in
a ring buffer called rx_ring . An incoming packet is dropped if
the ring buffer is full. When a packet is processed at higher layers,
packet data remains in the same kernel memory, avoiding any extra
memory copies.
http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf
That last sentence seems to indicate that incoming packet data is kept in kernel memory in sk_buff structs without redundancy. So I'd say the answer to your question is yes, that descriptor would point to an sk_buff. And yes, each packet is put in it's own sk_buff in rx_ring.
sk_buff has nothing to do with physical network interfaces (not directly, at least). sk_buff lists store data as seen by the socket accessing software and kernel protocol handlers (which manipulate those lists to add/remove headers and/or alter data, e.g. when encryption is employed).
It is a responsibility of a low level driver to translate sk_buff list contents into something physical network adapter will understand. In particular, network hardware can be really dumb (like when doing networking over serial lines), in which case the driver will basically read sk_buff lists byte by byte and send those over the wire.
The more advanced adapters are usually capable of doing scatter/gather DMA - given a list of addresses in RAM they will be able to access each address and either obtain a packet data from there or put a received data back. However the exact details of this mechanism are very much adapter specific and in many cases are not even consistent between single vendor's products.
I wonder that does this descriptor point to a sk_buff?
The answer is YES. This avoids copying memory from one place (the rx_ring dma buffer) to another (the sk_buff).
you can check the implementation of the b44 NIC driver (in drivers/net/ethernet/broadcom/b44.c), the function b44_init_rings pre-allocates a constant number of sk_buff for the rx_ring, which are also used as DMA buffers for the NIC.
static void b44_init_rings(struct b44 *bp)
{
int i;
b44_free_rings(bp);
memset(bp->rx_ring, 0, B44_RX_RING_BYTES);
memset(bp->tx_ring, 0, B44_TX_RING_BYTES);
if (bp->flags & B44_FLAG_RX_RING_HACK)
dma_sync_single_for_device(bp->sdev->dma_dev, bp->rx_ring_dma,
DMA_TABLE_BYTES, DMA_BIDIRECTIONAL);
if (bp->flags & B44_FLAG_TX_RING_HACK)
dma_sync_single_for_device(bp->sdev->dma_dev, bp->tx_ring_dma,
DMA_TABLE_BYTES, DMA_TO_DEVICE);
for (i = 0; i < bp->rx_pending; i++) {
if (b44_alloc_rx_skb(bp, -1, i) < 0)
break;
}
}

Minimizing copies when writing large data to a socket

I am writing an application server that processes images (large data). I am trying to minimize copies when sending image data back to clients. The processed images I need to send to clients are in buffers obtained from jemalloc. The ways I have thought of sending the data back to the client is:
1) Simple write call.
// Allocate buffer buf.
// Store image data in this buffer.
write(socket, buf, len);
2) I obtain the buffer through mmap instead of jemalloc, though I presume jemalloc already creates the buffer using mmap. I then make a simple call to write.
buf = mmap(file, len); // Imagine proper options.
// Store image data in this buffer.
write(socket, buf, len);
3) I obtain a buffer through mmap like before. I then use sendfile to send the data:
buf = mmap(in_fd, len); // Imagine proper options.
// Store image data in this buffer.
int rc;
rc = sendfile(out_fd, file, &offset, count);
// Deal with rc.
It seems like (1) and (2) will probably do the same thing given jemalloc probably allocates memory through mmap in the first place. I am not sure about (3) though. Will this really lead to any benefits? Figure 4 on this article on Linux zero-copy methods suggests that a further copy can be prevented using sendfile:
no data is copied into the socket buffer. Instead, only descriptors
with information about the whereabouts and length of the data are
appended to the socket buffer. The DMA engine passes data directly
from the kernel buffer to the protocol engine, thus eliminating the
remaining final copy.
This seems like a win if everything works out. I don't know if my mmaped buffer counts as a kernel buffer though. Also I don't know when it is safe to re-use this buffer. Since the fd and length is the only thing appended to the socket buffer, I assume that the kernel actually writes this data to the socket asynchronously. If it does what does the return from sendfile signify? How would I know when to re-use this buffer?
So my questions are:
What is the fastest way to write large buffers (images in my case) to a socket? The images are held in memory.
Is it a good idea to call sendfile on a mmapped file? If yes, what are the gotchas? Does this even lead to any wins?
It seems like my suspicions were correct. I got my information from this article. Quoting from it:
Also these network write system calls, including sendfile, might and
in many cases do return before the data sent over TCP by the method
call has been acknowledged. These methods return as soon as all data
is written into the socket buffers (sk buff) and is pushed to the TCP
write queue, the TCP engine can manage alone from that point on. In
other words at the time sendfile returns the last TCP send window is
not actually sent to the remote host but queued. In cases where
scatter-gather DMA is supported there is no seperate buffer which
holds these bytes, rather the buffers(sk buffs) just hold pointers to
the pages of OS buffer cache, where the contents of file is located.
This might lead to a race condition if we modify the content of the
file corresponding to the data in the last TCP send window as soon as
sendfile is returned. As a result TCP engine may send newly written
data to the remote host instead of what we originally intended to
send.
Provided the buffer from a mmapped file is even considered "DMA-able", seems like there is no way to know when it is safe to re-use it without an explicit acknowledgement (over the network) from the actual client. I might have to stick to simple write calls and incur the extra copy. There is a paper (also from the article) with more details.
Edit: This article on the splice call also shows the problems. Quoting it:
Be aware, when splicing data from a mmap'ed buffer to a network
socket, it is not possible to say when all data has been sent. Even if
splice() returns, the network stack may not have sent all data yet. So
reusing the buffer may overwrite unsent data.
For cases 1 and 2 - does the operation you marked as // Store image data in this buffer require any conversion? Is it just plain copy from the memory to buf?
If it's just plain copy, you can use write directly on the pointer obtained from jemalloc.
Assuming that img is a pointer obtained from jemalloc and size is a size of your image, just run following code:
int result;
int sent=0;
while(sent<size) {
result=write(socket,img+sent,size-sent);
if(result<0) {
/* error handling here */
break;
}
sent+=result;
}
It is working correctly for blocking I/O (the default behavior). If you need to write a data in a non-blocking manner, you should be able to rework the code on your own, but now you have the idea.
For case 3 - sendfile is for sending data from one descriptor to another. That means you can, for example, send data from file directly to tcp socket and you don't need to allocate any additional buffer. So, if the image you want to send to a client is in a file, just go for a sendfile. If you have it in memory (because you processed it somehow, or just generated), use the approach I mentioned earlier.

sendto on Tru64 is returning ENOBUF

I am currently running an old system on Tru64 which involves lots of UDP sockets using the sendto() function. The sockets are used in our code to send messages to/from various processes and then eventually on to a thick client app that is connected remotely. Occasionally the socket to the thick client gets stuck, this can cause some of these messages to get built up. My question is how can I determine the current buffer size, and how do I determine the maximum message buffer. The code below gives a snippet of how I set up the port and use the sendto function.
/* need to adjust the maximum size we can send on this */
/* as it needs to be able to cope with the biggest */
/* messages we send */
lenlen = sizeof(len) ;
/* allow double for when the system is under load */
int lenlen, len ;
lenlen = sizeof(len) ;
len = 2 * 32000;
msg_socket = socket( AF_UNIX,SOCK_DGRAM, 0);
result = setsockopt(msg_socket, SOL_SOCKET, SO_SNDBUF, (char *)&len, lenlen) ;
result = sendto( msg_socket,
(char *)message,
(int)message_len,
flags,
dest_addr,
addrlen);
Note. We have ported this application to Linux and the problem does not seem to appear there.
Any help would be greatly appreciated.
Regards
UDP send buffer size is different from TCP - it just limits the size of the datagram. Quoting Stevens UNP Vol. 1:
...
A UDP socket has a send buffer size (which we can change with SO_SNDBUF socket option, Section 7.5), but this is simply an upper limit on the maximum-sized UDP datagram that can be written to the socket. If an application writes a datagram larger than the socket send buffer size, EMSGSIZE is returned. Since UDP is unreliable, it does not need to keep a copy of the application's data and does not need an actual send buffer. (The application data is normally copied into a kernel buffer of some form as it passes down the protocol stack, but this copy is discarded by the datalink layer after the data is transmitted.)
UDP simply prepends 8-byte header and passes the datagram to IP. IPv4 or IPv6 prepends its header, determines the outgoing interface by performing the routing function, and then either adds the datagram to the datalink output queue (if it fits within the MTU) or fragments the datagram and adds each fragment to the datalink output queue. If a UDO application sends large datagrams (say 2,000-byte datagrams), there's a much higher probability of fragmentation than with TCP. because TCP breaks the application data into MSS-sized chunks, something that has no counterpart in UDP.
The successful return from write to a UDP socket tells us that either the datagram or all fragments of the datagram have been added to the datalink output queue. If there is no room on the queue for the datagram or one of its fragments, ENOBUFS is often returned to the application.
Unfortunately, some implementations do not return this error, giving the application no indication that the datagram was discarded without even being transmitted.
The last footnote needs attention - but it looks like Tru64 has this error code listed in the manual page.
The proper way of doing it though is to queue your outstanding messages in the application itself and to carefully check return values and the errno after each system call. This still does not guarantee delivery (since UDP receivers might drop the packets without any notice to the senders). Check the UDP packet discard counters with netstat -s on both/all sides, see if they are growing. There is really no way around this besides switching to TCP or implementing your own timeout/ack and re-transmission logic.
You should probably be using some sort of congestion control to avoid overloading the network. By far the easiest way to do this is to use TCP instead of UDP.
It fails less often on Linux because UDP sockets wait for space in the local network interface queue on Linux (unless you set them non-blocking). However, with any operating system, if the overfull queue is not in the local system, the packet will be dropped silently.

Resources