Freed cloned skb (using skb_clone) will free the original skb? - linux

I want to enqueue a skb buff in more than one queue. So I thought of using the cloning option.
Now my question is if I do kfree_skb on the cloned skb, will it release the original skb, or just drop one reference?
Thanks!

kfree_skb() will do the right thing with cloned skbuffs, i.e. free the skbuff structure itself but not the data, if it is still referenced by other skbuffs.
This is done on skb_release_data(), which checks if the skbuff is not a clone, or if this was the last reference to skb->data (done in a roundabout way to support headerless skbuffs, which hold references to the payload part of skb->data (higher 16 bits of skb->dataref), besides the usual reference to the whole of skb->data).

When a clone of an skb is made, a new memory is allocated for cloned sk_buff and all of the struct sk_buff members of the clone are private to the clone.
The data i.e packet, however, is shared between the original SKB and it's clone. So sk_buff structure only is copied to new memory. If you free the original skb, then data will be lost if dataref count is zero. Here data is your packet.
If you dont want to loose data on freeing up of any of skb, use skb_copy instead of skb_clone: skb_copy will copy both the sk_buff and the packet to new memory area.
EDIT: editing the previous reply with some correction.

Related

Will data be copied byte by byte when moving a variable?

I'm new to rust, and I am wondering what exactly happens when moving a variable.
struct Point {
x: i32,
y: i32,
}
fn main() {
let p = Point { x: 1, y: 1 };
let q = p;
}
When let q = p;, will the data (size of which is 8 bytes) be copied from a memory address to another? Since p is moved here thus it cannot be used anymore, I think it is good to make q's underlying memory address equal to p's. In another word, I think it is OK that nothing is copied in the machine code.
So my question is: will data be copied byte by byte when moving a variable? If it will, why?
[W]ill data be copied byte by byte when moving a variable?
In general, yes. To move a value, Rust simply performs a bitwise copy. If the value is not Copy, the source won't be used anymore after the move. If the value is Copy, both the source and the destination can be used.
However, there are many cases when the compiler backend can eliminate the copy by proving that the code beahves identical without the copy. This optimization happens completely in LLVM. In your example, the LLVM IR still contains the instructions to move the data, but the generated code does not contain the move even in debug mode.
If it will, why?
There are many reasons why the compiler can be unable to use the same memory for source and destination. In your example, with two variables in the same stack frame, it's easy to see that the move is not needed, but the code is a bit pointless anyway (though sometimes people do move values inside a function to make a variable immutable).
Here are just a few illustrations why the compiler may be unable to reuse the source memory for the destination:
The source value may be on the stack, while the destination is on the heap or vice versa. The statement let b = Box::new(3); will move the value 3 from the stack to the heap'; let i = *b; will move it from the heap back to the stack. It's still possible that the compiler can eliminate these moves, e.g. by writing the constant 3 to the heap immediately, without writing it to the stack first.
Source and destination may be on different stack frames, when moving values across functions – e.g. when passing a value into a function, or when returning a value from a function.
Source an destination values may be stored in struct fields, so they need to have the right offset inside the struct.
These are just a few examples. The takeaway is that in general, a move may result in a bitwise copy. Keep in mind that a bitwise copy is very cheap, though, and that the optimizer usually does a good job, so you should only worry about this if you actually have a proven performance bottleneck.

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.
As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)
Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

Memory Leaks while using gstbuffer

I have a pipeline, which takes data from webcam and process it.
For the processing i need to pull that buffer to appsink and push it into pipeline by using appsrc element.
While pushing i had used gst_buffer_new_wrapped function.
Then a new buffer is allocated every time i am pushing the data. But how to free that memory is the problem.
I had tried gst_buffer_unref(buffer);
Then got below error.
Error in `./uuHiesSoaServer': free(): invalid pointer: 0x00007fddf52f6000
I had take the data into an unsigned char pointer and then wrapped into a gstbuffer based on the size.
Now how to free the allocated memory?
g_signal_emit_by_name (Source, "push-buffer", Buffer, &ret);
I had used above function for pushing data into Source(appsrc).
That function will continuously call on a separate thread.
When data available to it, then the thread function will create a buffer using
gst_buffer_new_wrapped((void *)data, Size);
When checking in valgrind, for memory leaks, above line was shown as a leak.
How to solve this?
How do you push the buffer into appsrc?
If you use gst_app_src_push_buffer function I guess you do not have to free resources because gst_app_src_push_buffer will own the buffer (which means it also frees it)
Check this example
If you use need-data callback you may need to free data - check this example
HTH

LL_ALLOCATED_SPACE and other considerations

I have a kernel module wherein I capture a packet in PRE-ROUTING hook for some processing. I then allocate a new skb(cant do it in the same skb) and put the processed payload of the input skb and the ip header in this new skb. I would then want to do a netif_rx for this new skb and let it traverse the kernel networking stack.
I am little confused with the size of the new skb I should allocate, where my skb->data should point to (to network_header or mac header). What should be the skb->len, should it consider mac header or not?
len; // total length of new ip datagram
skb_new = dev_alloc_skb(len + LL_ALLOCATED_SPACE(skb->dev) + ETH_LEN);
after this how much should I reserve for LL header and trailer and where should my skb_new->data point to.
I want to call netif_rx(skb_new) after filling in the required details in skb. Basically what should follow after allocating the skb and before calling netif_rx.
Any link or description will help.
Thanks in advance.
A socket buffer (skb) holds ALL of the information for the current protocol data unit (PDU) and accounts for ALL of the data being passed around for this PDU. Regardless of how much slack space you leave at the beginning of the skb, you point your skb->data to the beginning of YOUR data.

Minimizing copies when writing large data to a socket

I am writing an application server that processes images (large data). I am trying to minimize copies when sending image data back to clients. The processed images I need to send to clients are in buffers obtained from jemalloc. The ways I have thought of sending the data back to the client is:
1) Simple write call.
// Allocate buffer buf.
// Store image data in this buffer.
write(socket, buf, len);
2) I obtain the buffer through mmap instead of jemalloc, though I presume jemalloc already creates the buffer using mmap. I then make a simple call to write.
buf = mmap(file, len); // Imagine proper options.
// Store image data in this buffer.
write(socket, buf, len);
3) I obtain a buffer through mmap like before. I then use sendfile to send the data:
buf = mmap(in_fd, len); // Imagine proper options.
// Store image data in this buffer.
int rc;
rc = sendfile(out_fd, file, &offset, count);
// Deal with rc.
It seems like (1) and (2) will probably do the same thing given jemalloc probably allocates memory through mmap in the first place. I am not sure about (3) though. Will this really lead to any benefits? Figure 4 on this article on Linux zero-copy methods suggests that a further copy can be prevented using sendfile:
no data is copied into the socket buffer. Instead, only descriptors
with information about the whereabouts and length of the data are
appended to the socket buffer. The DMA engine passes data directly
from the kernel buffer to the protocol engine, thus eliminating the
remaining final copy.
This seems like a win if everything works out. I don't know if my mmaped buffer counts as a kernel buffer though. Also I don't know when it is safe to re-use this buffer. Since the fd and length is the only thing appended to the socket buffer, I assume that the kernel actually writes this data to the socket asynchronously. If it does what does the return from sendfile signify? How would I know when to re-use this buffer?
So my questions are:
What is the fastest way to write large buffers (images in my case) to a socket? The images are held in memory.
Is it a good idea to call sendfile on a mmapped file? If yes, what are the gotchas? Does this even lead to any wins?
It seems like my suspicions were correct. I got my information from this article. Quoting from it:
Also these network write system calls, including sendfile, might and
in many cases do return before the data sent over TCP by the method
call has been acknowledged. These methods return as soon as all data
is written into the socket buffers (sk buff) and is pushed to the TCP
write queue, the TCP engine can manage alone from that point on. In
other words at the time sendfile returns the last TCP send window is
not actually sent to the remote host but queued. In cases where
scatter-gather DMA is supported there is no seperate buffer which
holds these bytes, rather the buffers(sk buffs) just hold pointers to
the pages of OS buffer cache, where the contents of file is located.
This might lead to a race condition if we modify the content of the
file corresponding to the data in the last TCP send window as soon as
sendfile is returned. As a result TCP engine may send newly written
data to the remote host instead of what we originally intended to
send.
Provided the buffer from a mmapped file is even considered "DMA-able", seems like there is no way to know when it is safe to re-use it without an explicit acknowledgement (over the network) from the actual client. I might have to stick to simple write calls and incur the extra copy. There is a paper (also from the article) with more details.
Edit: This article on the splice call also shows the problems. Quoting it:
Be aware, when splicing data from a mmap'ed buffer to a network
socket, it is not possible to say when all data has been sent. Even if
splice() returns, the network stack may not have sent all data yet. So
reusing the buffer may overwrite unsent data.
For cases 1 and 2 - does the operation you marked as // Store image data in this buffer require any conversion? Is it just plain copy from the memory to buf?
If it's just plain copy, you can use write directly on the pointer obtained from jemalloc.
Assuming that img is a pointer obtained from jemalloc and size is a size of your image, just run following code:
int result;
int sent=0;
while(sent<size) {
result=write(socket,img+sent,size-sent);
if(result<0) {
/* error handling here */
break;
}
sent+=result;
}
It is working correctly for blocking I/O (the default behavior). If you need to write a data in a non-blocking manner, you should be able to rework the code on your own, but now you have the idea.
For case 3 - sendfile is for sending data from one descriptor to another. That means you can, for example, send data from file directly to tcp socket and you don't need to allocate any additional buffer. So, if the image you want to send to a client is in a file, just go for a sendfile. If you have it in memory (because you processed it somehow, or just generated), use the approach I mentioned earlier.

Resources