I was going through the process of packet transmission and packet reception in latest linux kernel. I can see that, there is a framework in skb where it supports "linear" data as well as "paged" data.
It has a separate structure called skb_shared_info to represent page fragments.
Now my doubt is, how will the device DMA the entire contents of the packet? Is it not going to be scattered across the memory?
Thanks
CHID
It depends on the capability of the networking hardware. Most “modern” NICs can do gather/scatter DMA and handle tranferring a packet into multiple, non-contiguous buffers, but the Linux kernel networking stack will only give an skb with nonlinear data to a driver/netdev if NETIF_F_SG is set (indicating the device can handle scatter/gather). If the device driver sets NETIF_F_SG then it is telling the stack that it can handle multiple physical buffers per packet.
Related
We need to transfer a lot of data (think Gbps) from ethernet to ethernet, adjusting it on the fly. Data from input (present in ram) must be scatter-gathered from pieces (many udp frames, parts of payload) to one contiguous chunk (in another ram place) and sent.
This is great opportunity to use scatter-gather DMA with memory to memory mode. But I don't see, that it is possible to use DMA engine from userspace linux on x64 cpus. Do I miss something? Maybe cpu is so optimized copying data, there is no need for userspace DMA?
We are talking about generic x64 server architecture, 32 cores.
This more a general question. Consider an external device. From time to time this device writes data via its device driver to a specific memory address. I want to write a small C program which read out this data. Is there a better way than just polling this address to check if the value has been changed? I want to keep the CPU load low.
I have done some further research
Is "memory mapped IO" an option? My naive idea is to let the external device writes a flag to a "memory mapped IO"-address which triggers a kernel device driver. The driver then "informs" the program which proceed the value. Can this work? How can a driver informs the program?
The answer may depend on what processor you intend to use, what the device is and possibly whether you are using an operating system or RTOS.
Memory mapped I/O per se is not a solution, that simply refers to I/O device registers that can be directly addressed via normal memory access instructions. Most devices will generate an interrupt when certain registers are updated or contain new valid data.
In general if using an RTOS you can arrange for the device driver to signal via a suitable IPC mechanism any client thread(s) that need to handle the data. If you are not using an RTOS, you could simply register a callback with the device driver which it would call whenever the data is updated. What the client does in the call back is its business - including reading the new data.
If the device in question generates interrupts, then the handling can be done on interrupt, if the device is capable of DMA, then it can handle blocks of data autonomously before teh DMA controller generates an DMA interrupt to a handler.
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
TL;DR - They are different concepts, but it is quite likely that zero copy is supported within kernel bypass API/framework.
User Bypass
This mode of communicating should also be considered. It maybe possible for DMA-to-DMA transactions which do not involve the CPU at all. The idea is to use splice() or similar functions to avoid user space at all. Note, that with splice(), the entire data stream does not need to bypass user space. Headers can be read in user space and data streamed directly to disk. The most common downfall of this is splice() doesn't do checksum offloading.
Zero copy
The zero copy concept is only that the network buffers are fixed in place and are not moved around. In many cases, this is not really beneficial. Most modern network hardware supports scatter gather, also know as buffer descriptors, etc. The idea is the network hardware understands physical pointers. The buffer descriptor typically consists of,
Data pointer
Length
Next buffer descriptor
The benefit is that the network headers do not need to exist side-by-side and IP, TCP, and Application headers can reside physically seperate from the application data.
If a controller doesn't support this, then the TCP/IP headers must precede the user data so that they can be filled in before sending to the network controller.
zero copy also implies some kernel-user MMU setup so that pages are shared.
Kernel Bypass
Of course, you can bypass the kernel. This is what pcap and other sniffer software has been doing for some time. However, pcap does not prevent the normal kernel processing; but the concept is similar to what a kernel bypass framework would allow. Ie, directly deliver packets to user space where processing headers would happen.
However, it is difficult to see a case where user space will have a definite win unless it is tied to the particular hardware. Some network controllers may have scatter gather supported in the controller and others may not.
There are various incarnation of kernel interfaces to accomplish kernel by-pass. A difficulty is what happens with the received data and producing the data for transmission. Often this involve other devices and so there are many solutions.
To put this together...
Are they two phrases meaning the same thing, or different?
They are different as above hopefully explains.
Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
It is the opposite. Kernel bypass can use zero copy and most likely will support it as the buffers are completely under control of the application. Also, there is no memory sharing between the kernel and user space (meaning no need for MMU shared pages and whatever cache/TLB effects that may cause). So if you are using kernel bypass, it will often be advantageous to support zero copy; so the things may seem the same at first.
If scatter-gather DMA is available (most modern controllers) either user space or the kernel can use it. zero copy is not as useful in this case.
Reference:
Technical reference on OnLoad, a high band width kernel by-pass system.
PF Ring as of 2.6.32, if configured
Linux kernel network buffer management by David Miller. This gives an idea of how the protocols headers/trailers are managed in the kernel.
Zero-copy networking
You're doing zero-copy networking when you never copy the data between the user-space and the kernel-space (I mean memory space). By example:
C language
recv(fd, buffer, BUFFER_SIZE, 0);
By default the data are copied:
The kernel gets the data from the network stack
The kernel copies this data to the buffer, which is in the user-space.
With zero-copy method, the data are not copied and come to the user-space directly from the network stack.
Kernel Bypass
The kernel bypass is when you manage yourself, in the user-space, the network stack and hardware stuff. It is hard, but you will gain a lot of performance (there is zero copy, since all the data are in the user-space). This link could be interesting if you want more information.
ZERO-COPY:
When transmitting and receiving packets,
all packet data must be copied from user-space buffers to kernel-space buffers for transmitting and vice versa for receiving. A zero-copy driver avoids this by having user space and the driver share packet buffer memory directly.
Instead of having the transmit and receive point to buffers in kernel space which will later require to copy, a region of memory in user space is allocated, and mapped to a given region of physical memory, to be shared memory between the kernel buffers and the user-space buffers, then point each descriptor buffer to its corresponding place in the newly allocated memory.
Other examples of kernel bypass and zero copy are DPDK and RDMA. When an application uses DPDK it is bypassing the kernel TCP/IP stack. The application is creating the Ethernet frames and the NIC grabbing those frames with DMA directly from user space memory so it's zero copy because there is no copy from user space to kernel space. Applications can do similar things with RDMA. The application writes to queue pairs that the NIC directly access and transmits. RDMA iblibverbs is used inside the kernel as well so when iSER is using RDMA it's not Kernel bypass but it is zero copy.
http://dpdk.org/
https://www.openfabrics.org/index.php/openfabrics-software.html
What is the difference between DMA and memory-mapped IO? They both look similar to me.
Memory-mapped I/O allows the CPU to control hardware by reading and writing specific memory addresses. Usually, this would be used for low-bandwidth operations such as changing control bits.
DMA allows hardware to directly read and write memory without involving the CPU. Usually, this would be used for high-bandwidth operations such as disk I/O or camera video input.
Here is a paper has a thorough comparison between MMIO and DMA.
Design Guidelines for High Performance RDMA Systems
Since others have already answered the question, I'll just add a little bit of history.
Back in the old days, on x86 (PC) hardware, there was only I/O space and memory space. These were two different address spaces, accessed with different bus protocol and different CPU instructions, but able to talk over the same plug-in card slot.
Most devices used I/O space for both the control interface and the bulk data-transfer interface. The simple way to access data was to execute lots of CPU instructions to transfer data one word at a time from an I/O address to a memory address (sometimes known as "bit-banging.")
In order to move data from devices to host memory autonomously, there was no support in the ISA bus protocol for devices to initiate transfers. A compromise solution was invented: the DMA controller. This was a piece of hardware that sat up by the CPU and initiated transfers to move data from a device's I/O address to memory, or vice versa. Because the I/O address is the same, the DMA controller is doing the exact same operations as a CPU would, but a little more efficiently and allowing some freedom to keep running in the background (though possibly not for long as it can't talk to memory).
Fast-forward to the days of PCI, and the bus protocols got a lot smarter: any device can initiate a transfer. So it's possible for, say, a RAID controller card to move any data it likes to or from the host at any time it likes. This is called "bus master" mode, but for no particular reason people continue to refer to this mode as "DMA" even though the old DMA controller is long gone. Unlike old DMA transfers, there is frequently no corresponding I/O address at all, and the bus master mode is frequently the only interface present on the device, with no CPU "bit-banging" mode at all.
Memory-mapped IO means that the device registers are mapped into the machine's memory space - when those memory regions are read or written by the CPU, it's reading from or writing to the device, rather than real memory. To transfer data from the device to an actual memory buffer, the CPU has to read the data from the memory-mapped device registers and write it to the buffer (and the converse for transferring data to the device).
With a DMA transfer, the device is able to directly transfer data to or from a real memory buffer itself. The CPU tells the device the location of the buffer, and then can perform other work while the device is directly accessing memory.
Direct Memory Access (DMA) is a technique to transfer the data from I/O to memory and from memory to I/O without the intervention of the CPU. For this purpose, a special chip, named DMA controller, is used to control all activities and synchronization of data. As result, compare to other data transfer techniques, DMA is much faster.
On the other hand, Virtual memory acts as a cache between main memory and secondary memory. Data is fetched in advance from the secondary memory (hard disk) into the main memory so that data is already available in the main memory when needed. It allows us to run more applications on the system than we have enough physical memory to support.
I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne
How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
At driver initialization, it will allocate some number of buffers and give these to the NIC.
When a packet is received by the NIC, it pulls the next address off its list of buffers, DMAs the data directly into it, and notifies the driver via an interrupt.
The driver gets the interrupt, and can either turn the buffer over to the kernel or it will allocate a new kernel buffer and copy the data. "Zero copy networking" is the former and obviously requires support from the operating system. (more below on this)
The driver needs to either allocate a new buffer (in the zero-copy case) or it will re-use the buffer. In either case, the buffer is given back to the NIC for future packets.
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection, fork(), and both processes are able to recv() incoming data.
TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At recv() time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.
I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.
take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions