copying skb->data to multiple descriptors

copying skb->data to multiple descriptors - linux

i am studying 8139too.c driver. for the transmit, the driver calls skb_copy_and_csum_dev() to copy the entire socket buffer into a descriptor ring whose buffer is big enough for the entire socket buffer.
if the descriptor ring buffer is smaller than skb->data, what is the correct way to break break skb->data up and copy skb->data into multiple descriptors?
(assuming scatter/gather is not being used)
thank you very much.

In the function **rtl8139_start_xmit() ** of 8139 driver, it first check whether the length of skb->data is larger than TX_BUF_SIZE, which is the MAX_ETH_FRAME_SIZE. If it is larger than TX_BUF_SIZE, the driver drops the packet.
if (likely(len < TX_BUF_SIZE)) {
if (len < ETH_ZLEN)
memset(tp->tx_buf[entry], 0, ETH_ZLEN);
skb_copy_and_csum_dev(skb, tp->tx_buf[entry]);
dev_kfree_skb(skb);
} else {
dev_kfree_skb(skb);
tp->stats.tx_dropped++;
return 0;
}
Generally, if the packet you try to send is larger than MAX_ETH_FRAME_SIZE, the IP layer of the protocol stack will fragment the packet, like you said "break XXX up". But when packet goes down to driver, it won't be broken up any more.
More infomation:
IP fragmentation on wikipedia

Related

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.

As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)

Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

Linux UART slower than specified Baudrate

I'm trying to communicate between two Linux systems via UART.
I want to send large chunks of data. With the specified Baudrate it should take around 5 seconds, but it takes nearly 10 times the expected time.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between. If I measure the time needed for the drain and the number of bytes written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
I would expect a slower transmission as the optimal, but not this much.
Did I miss something while setting the UART or while writing? Or is this normal?
The code used for setup:
int bus = open(interface.c_str(), O_RDWR | O_NOCTTY | O_NDELAY); // <- also tryed blocking
if (bus < 0) {
return;
}
struct termios options;
memset (&options, 0, sizeof options);
if(tcgetattr(bus, &options) != 0){
close(bus);
bus = -1;
return;
}
cfsetspeed (&options, B230400);
cfmakeraw(&options); // <- also tried this manually. did not make a difference
if(tcsetattr(bus, TCSANOW, &options) != 0)
{
close(bus);
bus = -1;
return;
}
tcflush(bus, TCIFLUSH);
The code used to send:
int32_t res = write(bus, data, dataLength);
while (res < dataLength){
tcdrain(bus); // <- taking way longer than expected
int32_t r = write(bus, &data[res], dataLength - res);
if(r == 0)
break;
if(r == -1){
break;
}
res += r;
}

B230400
The docs are contradictory. cfsetspeed is documented as requiring a speed_t type, while the note says you need to use one of the "B" constants like "B230400." Have you tried using an actual speed_t type?
In any case, the speed you're supplying is the baud rate, which in this case should get you approximately 23,000 bytes/second, assuming there is no throttling.
The speed is dependent on hardware and link limitations. Also the serial protocol allows pausing the transmission.
FWIW, according to the time and speed you listed, if everything works perfectly, you'll get about 1 MB in 50 seconds. What speed are you actually getting?
Another "also" is the options structure. It's been years since I've had to do any serial I/O, but IIRC, you need to actually set the options that you want and are supported by your hardware, like CTS/RTS, XON/XOFF, etc.
This might be helpful.

As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between.
You have only provided code snippets (rather than a minimal, complete, and verifiable example), so your data size is unknown.
But the Linux kernel buffer size is known. What do you think it is?
(FYI it's 4KB.)
If I measure the time needed for the drain and the number of bytese written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
You're confusing throughput with baudrate.
The maximum throughput (of just payload) of an asynchronous serial link will always be less than the baudrate due to framing overhead per character, which could be two of the ten bits of the frame (assuming 8N1). Since your termios configuration is incomplete, the overhead could actually be three of the eleven bits of the frame (assuming 8N2).
In order to achieve the maximum throughput, the tranmitting UART must saturate the line with frames and never let the line go idle.
The userspace program must be able to supply data fast enough, preferably by one large write() to reduce syscall overhead.
Did I miss something while setting the UART or while writing?
With Linux, you have limited access to the UART hardware.
From userspace your program accesses a serial terminal.
Your program accesses the serial terminal in a sub-optinal manner.
Your termios configuration appears to be incomplete.
It leaves both hardware and software flow-control untouched.
The number of stop bits is untouched.
The Ignore modem control lines and Enable receiver flags are not enabled.
For raw reading, the VMIN and VTIME values are not assigned.
Or is this normal?
There are ways to easily speed up the transfer.
First, your program combines non-blocking mode with non-canonical mode. That's a degenerate combination for receiving, and suboptimal for transmitting.
You have provided no reason for using non-blocking mode, and your program is not written to properly utilize it.
Therefore your program should be revised to use blocking mode instead of non-blocking mode.
Second, the tcdrain() between write() syscalls can introduce idle time on the serial link. Use of blocking mode eliminates the need for this delay tactic between write() syscalls.
In fact with blocking mode only one write() syscall should be needed to transmit the entire dataLength. This would also minimize any idle time introduced on the serial link.
Note that the first write() does not properly check the return value for a possible error condition, which is always possible.
Bottom line: your program would be simpler and throughput would be improved by using blocking I/O.

What is the relationship between RX/TX ring and sk_buff?

I know that each NIC has its RX/TX ring in RAM for OS receiving/transmitting packets. And one item(packet descriptor) in the ring includes physical address of a packet, length of a packet and etc. I wonder that does this descriptor point to a sk_buff? And what happens if the packet is a GSO packet?Is this true that one descriptor in the ring = one packet = one sk_buff?

I wonder that does this descriptor point to a sk_buff?
Not exactly. sk_buff is a software construct, roughly, a data structure containing meta information to describe some chunk of network data AND point to the data itself. So, NIC descriptor doesn't need to point to sk_buff - it may only point to a data buffer (DMA/physical address is used).
And what happens if the packet is a GSO packet?
It's quite ambiguous question to answer since such offloads may be implemented in software (say, by the network stack) and may be done in hardware.
In the former case there is nothing to discuss in terms of NIC SW descriptors - the upper layer application provides a contiguous chunk of data, and the network stack produces smaller packets from it, so that sk_buff-s handed over to the network driver already describe small packets.
In the latter case (HW offload) the network driver is supplied with huge chunks of data (by means of handing over single sk_buff-s or sk_buff chains to it), and the network driver in turn posts appropriate descriptors to NIC - it may be one descriptor pointing to a big chunk of data, or a handful of descriptors pointing to smaller parts of the same contiguous data buffer - it doesn't matter a lot since the offload magic will take place in the HW - the overall data chunk will be sliced and packet headers will be prepended accordingly yielding many smaller network packets to be put on wire.
Is this true that one descriptor in the ring = one packet = one sk_buff?
Strictly speaking, no. It depends. Your network driver may be asked to transmit one sk_buff describing one data buffer. However, your driver under certain circumstances may decide to post multiple descriptors pointing to the same chunk of data but with different offsets - i.e. submission will be done in parts and there will be multiple descriptors in the NIC's ring related to a single sk_buff. Also, one packet is not always the same as one sk_buff - a packet may be presented as a handful of segments each described with a separate sk_buff forming an sk_buff chain (please find the next and prev fields in sk_buff).

The Linux kernel uses an sk_buff data structure to describe each
packet. When a packet arrives at the NIC, it invokes the DMA engine to
place the packet into the kernel memory via empty sk_buff's stored in
a ring buffer called rx_ring . An incoming packet is dropped if
the ring buffer is full. When a packet is processed at higher layers,
packet data remains in the same kernel memory, avoiding any extra
memory copies.
http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf
That last sentence seems to indicate that incoming packet data is kept in kernel memory in sk_buff structs without redundancy. So I'd say the answer to your question is yes, that descriptor would point to an sk_buff. And yes, each packet is put in it's own sk_buff in rx_ring.

sk_buff has nothing to do with physical network interfaces (not directly, at least). sk_buff lists store data as seen by the socket accessing software and kernel protocol handlers (which manipulate those lists to add/remove headers and/or alter data, e.g. when encryption is employed).
It is a responsibility of a low level driver to translate sk_buff list contents into something physical network adapter will understand. In particular, network hardware can be really dumb (like when doing networking over serial lines), in which case the driver will basically read sk_buff lists byte by byte and send those over the wire.
The more advanced adapters are usually capable of doing scatter/gather DMA - given a list of addresses in RAM they will be able to access each address and either obtain a packet data from there or put a received data back. However the exact details of this mechanism are very much adapter specific and in many cases are not even consistent between single vendor's products.

I wonder that does this descriptor point to a sk_buff?
The answer is YES. This avoids copying memory from one place (the rx_ring dma buffer) to another (the sk_buff).
you can check the implementation of the b44 NIC driver (in drivers/net/ethernet/broadcom/b44.c), the function b44_init_rings pre-allocates a constant number of sk_buff for the rx_ring, which are also used as DMA buffers for the NIC.
static void b44_init_rings(struct b44 *bp)
{
int i;
b44_free_rings(bp);
memset(bp->rx_ring, 0, B44_RX_RING_BYTES);
memset(bp->tx_ring, 0, B44_TX_RING_BYTES);
if (bp->flags & B44_FLAG_RX_RING_HACK)
dma_sync_single_for_device(bp->sdev->dma_dev, bp->rx_ring_dma,
DMA_TABLE_BYTES, DMA_BIDIRECTIONAL);
if (bp->flags & B44_FLAG_TX_RING_HACK)
dma_sync_single_for_device(bp->sdev->dma_dev, bp->tx_ring_dma,
DMA_TABLE_BYTES, DMA_TO_DEVICE);
for (i = 0; i < bp->rx_pending; i++) {
if (b44_alloc_rx_skb(bp, -1, i) < 0)
break;
}
}

LL_ALLOCATED_SPACE and other considerations

I have a kernel module wherein I capture a packet in PRE-ROUTING hook for some processing. I then allocate a new skb(cant do it in the same skb) and put the processed payload of the input skb and the ip header in this new skb. I would then want to do a netif_rx for this new skb and let it traverse the kernel networking stack.
I am little confused with the size of the new skb I should allocate, where my skb->data should point to (to network_header or mac header). What should be the skb->len, should it consider mac header or not?
len; // total length of new ip datagram
skb_new = dev_alloc_skb(len + LL_ALLOCATED_SPACE(skb->dev) + ETH_LEN);
after this how much should I reserve for LL header and trailer and where should my skb_new->data point to.
I want to call netif_rx(skb_new) after filling in the required details in skb. Basically what should follow after allocating the skb and before calling netif_rx.
Any link or description will help.
Thanks in advance.

A socket buffer (skb) holds ALL of the information for the current protocol data unit (PDU) and accounts for ALL of the data being passed around for this PDU. Regardless of how much slack space you leave at the beginning of the skb, you point your skb->data to the beginning of YOUR data.

Serial port : not able to write big chunk of data

I am trying to send text data from one PC to other using Serial cable. One of the PC is running linux and I am sending data from it using write(2) system call. The log size is approx 65K bytes but the write(2) system call returns some 4K bytes (i.e. this much amount of data is getting transferred). I tried breaking the data in chunks of 4K but write(2) returns -1.
My question is that "Is there any buffer limit for writing data on serial port? or can I send data of any size?. Also do I need to continously read data from other PC as I write 4K chunk of data"
Do I need to do any special configuration in termios structure for sending (huge) data?

The transmit buffer is one page (took a look at Linux 2.6.18 sources) - which is 4K in most (if not all) cases.
The other end must read (don't know the size of the receive buffer), but more importantly you should not write faster than the serial port can transmit, if you are using 115200 bps 8-N-1 you can write the 4K chunk approximately 3 times a second. (115200 / 9 / 4096 = 3.125)

Yes, there is a buffer limit - but when you reach that limit, the write() should block.
When write() returns -1, what is errno set to?

Make sure that the receiver is reading.
You should update the current position it your buffer from the write(), and continue the next write from there. (Applies to all writes(), regardless if the fd is a serial port, tcp socket or a file.)
If you get an error back for subsequent writes. Judging by the manpage, its safe to retry the writes for the following errnos: EAGAIN, EINTR, and probably ENOSPC. Use perror() to see what you get. (..and post it, I am curious.)
EFBIG would seem to indicate that you are trying to write using a buffer (or rather count) that is too large, but that is probably much larger than 64k.
If the internal buffer is filled up, because you are writing to fast, try to (nano)sleep a little between the writes. There are several clever ways of doing this (like tcp does), but if the rate is known, just write at a fixed rate.
If you think the receiver is actually reading, but not much happens, have a look at the serial ports flow-control options and if the cable is wired for DTS/RTS.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string