netfilter hook is not retrieving complete packet - linux

I'm writing a netfilter module, that deeply inspect the packet. However, during tests I found that netfilter module is not receiving the packet in full.
To verify this, I wrote the following code to dump packet retrieved on port 80 and write the result to dmesg buffer:
const struct iphdr *ip_header = ip_hdr(skb);
if (ip_header->protocol == IPPROTO_TCP)
{
const struct tcphdr *tcp_header = tcp_hdr(skb);
if (ntohs(tcp_header->dest) != 80)
{
return NF_ACCEPT;
}
buff = (char *)kzalloc(skb->len * 10, GFP_KERNEL);
if (buff != NULL)
{
int pos = 0, i = 0;
for (i = 0; i < skb->len; i ++)
{
pos += sprintf(buff + pos, "%02X", skb->data[i] & 0xFF);
}
pr_info("(%pI4):%d --> (%pI4):%d, len=%d, data=%s\n",
&ip_header->saddr,
ntohs(tcp_header->source),
&ip_header->daddr,
ntohs(tcp_header->dest),
skb->len,
buff
);
kfree (buff);
}
}
In virtual machine running locally, I can retrieve the full HTTP request; On Alibaba cloud, and some other OpenStack based VPS provider, the packet is cut in the middle.
To verify this, I execute curl http://VPS_IP on another VPS, and I got the following output in dmesg buffer:
[ 1163.370483] (XXXX):5007 --> (XXXX):80, len=237, data=451600ED000040003106E3983D87A950AC11D273138F00505A468086B44CE19E80180804269300000101080A1D07500A000D2D90474554202F20485454502F312E310D0A486F73743A2033392E3130372E32342E37370D0A4163636570743A202A2F2A0D0A557365722D4167656E743A204D012000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000001E798090F5FFFF8C0000007B00000000E0678090F5FFFF823000003E00000040AE798090F5FFFF8C0000003E000000000000000000000000000000000000000000000000000000000000
When decoded, the result is like this
It's totally weird, everything after User-Agent: M is "gone" or zero-ed. Although the skb->len is 237, but half of the packet is missing.
Any ideas? Tried both PRE_ROUTING and LOCAL_IN, no changes.

It appears that sometimes you are getting a linear skb, and sometimes your skb is not linear. In the latter case you are not reading the full data contents of an skb.
If skb->data_len is zero, then your skb is linear and the full data contents of the skb is in skb->data. If skb->data_len is not zero, then your skb is not linear, and skb->data contains just the the first (linear) part of the data. The length of this area is skb->len - skb->data_len. skb_headlen() helper function calculates that for convenience. skb_is_nonlinear() helper function tells in an skb is linear or not.
The rest of the data can be in paged fragments, and in skb fragments, in this order.
skb_shinfo(skb)->nr_frags tells the number of paged fragments. Each paged fragment is described by a data structure in the array of structures skb_shinfo(skb)->frags[0..skb_shinfo(skb)->nr_frags]. skb_frag_size() and skb_frag_address() helper functions help dealing with this data. They accept the address of the structure that describes a paged fragment. There are other useful helper functions depending on your kernel version.
If the total size of data in paged fragments is less than skb->data_len, then the rest of the data is in skb fragments. It's the list of skb which is attached to this skb at skb_shinfo(skb)->frag_list (see skb_walk_frags() in the kernel).
Please note that there may be that there's no data in the linear part and/or there's no data in the paged fragments. You just need to process data piece by piece in the order just described.

Related

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.
The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.
The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA.
Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).
The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512).
The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.
In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA.
Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.
In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.
So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel.
Of course such a function should verify that the buffer indeed consists of hugepages.
I have found and read these posts: (A), (B), but couldn't find a good answer.
Is there any official method to do it in the current Linux kernel?
At the moment I have a very inefficient solution based on get_user_pages_fast:
int sgt_prepare(const char __user *buf, size_t count,
struct sg_table * sgt, struct page *** a_pages,
int * a_n_pages)
{
int res = 0;
int n_pages;
struct page ** pages = NULL;
const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
//Calculate number of pages
n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
printk(KERN_ALERT "n_pages: %d",n_pages);
//Allocate the table for pages
pages = vzalloc(sizeof(* pages) * n_pages);
printk(KERN_ALERT "pages: %p",pages);
if(pages == NULL) {
res = -ENOMEM;
goto sglm_err1;
}
//Now pin the pages
res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages);
printk(KERN_ALERT "gupf: %d",res);
if(res < n_pages) {
int i;
for(i=0; i<res; i++)
put_page(pages[i]);
res = -ENOMEM;
goto sglm_err1;
}
//Now create the sg-list
res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
printk(KERN_ALERT "satf: %d",res);
if(res < 0)
goto sglm_err2;
*a_pages = pages;
*a_n_pages = n_pages;
return res;
sglm_err2:
//Here we jump if we know that the pages are pinned
{
int i;
for(i=0; i<n_pages; i++)
put_page(pages[i]);
}
sglm_err1:
if(sgt) sg_free_table(sgt);
if(pages) kfree(pages);
* a_pages = NULL;
* a_n_pages = 0;
return res;
}
void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
{
int i;
//Free the sg list
if(sgt->sgl)
sg_free_table(sgt);
//Unpin pages
for(i=0; i < n_pages; i++) {
set_page_dirty(pages[i]);
put_page(pages[i]);
}
}
The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.
Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.
Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.
The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

Linux: How to mmap a sequence of physically contiguous areas into user space?

In my driver I have certain number of physically contiguous DMA buffers (e.g. 4MB long each) to receive data from a device. They are handled by hardware using the SG list. As the received data will be subjected to intensive processing, I don't want to switch off cache and I will use dma_sync_single_for_cpu after each buffer is filled by DMA.
To simplify data processing, I want those buffers to appear as a single huge, contiguous, circular buffer in the user space.
In case of a single buffer I simply use remap_pfn_range or dma_mmap_coherent. However, I can't use those functions multiple times to map consecutive buffers.
Of course, I can implement the fault operation in the vm_operations so that it finds the pfn of the corresponding page in the right buffer, and inserts it into the vma with vm_insert_pfn.
The acquisition will be really fast, so I can't handle mapping when the real data arrive. But this can be solved easily. To have all mapping ready before the data acquisition starts, I can simply read the whole mmapped buffer in my application before starting the acquisition, so that all pages are already inserted when the first data arrive.
Tha fault based trick should work, but maybe there is something more elegant? Just a single function, that may be called multiple times to build the whole mapping incrementally?
Additional difficulty is that the solution should be applicable (with minimal adjustments) to kernels starting from 2.6.32 to the newest one.
PS. I have seen that annoying post. Is there a danger that if the application attempts to write something to the mmapped buffer (just doing the in place processing of data), my carefully built mapping will be destroyed by COW?
Below is my solution that works for buffers allocated with dmam_alloc_noncoherent.
Allocation of the buffers:
[...]
for(i=0;i<DMA_NOFBUFS;i++) {
ext->buf_addr[i] = dmam_alloc_noncoherent(&my_dev->dev, DMA_BUFLEN, &my_dev->buf_dma_t[i],GFP_USER);
if(my_dev->buf_addr[i] == NULL) {
res = -ENOMEM;
goto err1;
}
//Make buffer ready for filling by the device
dma_sync_single_range_for_device(&my_dev->dev, my_dev->buf_dma_t[i],0,DMA_BUFLEN,DMA_FROM_DEVICE);
}
[...]
Mapping of the buffers
void swz_mmap_open(struct vm_area_struct *vma)
{
}
void swz_mmap_close(struct vm_area_struct *vma)
{
}
static int swz_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
long offset;
char * buffer = NULL;
int buf_num = 0;
//Calculate the offset (according to info in https://lxr.missinglinkelectronics.com/linux+v2.6.32/drivers/gpu/drm/i915/i915_gem.c#L1195 it is better not ot use the vmf->pgoff )
offset = (unsigned long)(vmf->virtual_address - vma->vm_start);
buf_num = offset/DMA_BUFLEN;
if(buf_num > DMA_NOFBUFS) {
printk(KERN_ERR "Access outside the buffer\n");
return -EFAULT;
}
offset = offset - buf_num * DMA_BUFLEN;
buffer = my_dev->buf_addr[buf_num];
vm_insert_pfn(vma,(unsigned long)(vmf->virtual_address),virt_to_phys(&buffer[offset]) >> PAGE_SHIFT);
return VM_FAULT_NOPAGE;
}
struct vm_operations_struct swz_mmap_vm_ops =
{
.open = swz_mmap_open,
.close = swz_mmap_close,
.fault = swz_mmap_fault,
};
static int char_sgdma_wz_mmap(struct file *file, struct vm_area_struct *vma)
{
vma->vm_ops = &swz_mmap_vm_ops;
vma->vm_flags |= VM_IO | VM_RESERVED | VM_CAN_NONLINEAR | VM_PFNMAP;
swz_mmap_open(vma);
return 0;
}

SPI linux driver

I am trying to learn how to write a basic SPI driver and below is the probe function that I wrote.
What I am trying to do here is setup the spi device for fram(datasheet) and use the spi_sync_transfer()api description to get the manufacturer's id from the chip.
When I execute this code, I can see the data on the SPI bus using logic analyzer but I am unable to read it using the rx buffer. Am I missing something here? Could someone please help me with this?
static int fram_probe(struct spi_device *spi)
{
int err;
unsigned char ch16[] = {0x9F,0x00,0x00,0x00};// 0x9F => 10011111
unsigned char rx16[] = {0x00,0x00,0x00,0x00};
printk("[FRAM DRIVER] fram_probe called \n");
spi->max_speed_hz = 1000000;
spi->bits_per_word = 8;
spi->mode = (3);
err = spi_setup(spi);
if (err < 0) {
printk("[FRAM DRIVER::fram_probe spi_setup failed!\n");
return err;
}
printk("[FRAM DRIVER] spi_setup ok, cs: %d\n", spi->chip_select);
spi_element[0].tx_buf = ch16;
spi_element[1].rx_buf = rx16;
err = spi_sync_transfer(spi, spi_element, ARRAY_SIZE(spi_element)/2);
printk("rx16=%x %x %x %x\n",rx16[0],rx16[1],rx16[2],rx16[3]);
if (err < 0) {
printk("[FRAM DRIVER]::fram_probe spi_sync_transfer failed!\n");
return err;
}
return 0;
}
spi_element is not declared in this example. You should show that and also how all elements of that are array are filled. But just from the code that's there I see a couple mistakes.
You need to set the len parameter of spi_transfer. You've assigned the TX or RX buffer to ch16 or rx16 but not set the length of the buffer in either case.
You should zero out all the fields not used in the spi_transfer.
If you set the length to four, you would not be sending the proper command according to the datasheet. RDID expects a one byte command after which will follow four bytes of output data. You are writing a four byte command in your first transfer and then reading four bytes of data. The tx_buf in the first transfer should just be one byte.
And finally the number of transfers specified as the last argument to spi_sync_transfer() is incorrect. It should be 2 in this case because you have defined two, spi_element[0] and spi_element[1]. You could use ARRAY_SIZE() if spi_element was declared for the purpose of this message and you want to sent all transfers in the array.
Consider this as a way to better fill in the spi_transfers. It will take care of zeroing out fields that are not used, defines the transfers in a easy to see way, and changing the buffer sizes or the number of transfers is automatically accounted for in remaining code.
const char ch16[] = { 0x8f };
char rx16[4];
struct spi_transfer rdid[] = {
{ .tx_buf = ch16, .len = sizeof(ch16) },
{ .rx_buf = rx16, .len = sizeof(rx16) },
};
spi_transfer(spi, rdid, ARRAY_SIZE(rdid));
Since you have a scope, be sure to check that this operation happens under a single chip select pulse. I have found more than one Linux SPI driver to have a bug that pulses chip select when it should not. In some cases switching from TX to RX (like done above) will trigger a CS pulse. In other cases a CS pulse is generated for every word (8 bits here) of data.
Another thing you should change is use dev_info(&spi->dev, "device version %d", id)' and also dev_err() to print messages. This inserts the device name in a standard way instead of your hard-coded non-standard and inconsistent "[FRAME DRIVER]::" text. And sets the level of the message as appropriate.
Also, consider supporting device tree in your driver to read device properties. Then you can do things like change the SPI bus frequency for this device without rebuilding the kernel driver.

ping pong DMA in kernel

I am using a SoC to DMA data from the programmable logic side (PL) to the processor side (PS). My overall scheme is to allocate two separate dma buffers and ping-pong data to these buffers to the user-space application can access data without collisions. I have successfully done this using single dma transactions in a loop (in a kthread) with each transaction either on the first or second buffer. I was able to notify a poll() method of either the first or second buffer.
Now I would like to investigate scatter-gather and/or cyclic DMA.
struct scatterlist sg[2]
struct dma_async_tx_descriptor *chan_desc;
struct dma_slave_config config;
sg_init_table(sg, ARRAY_SIZE(sg));
addr_0 = (unsigned int *)dma_alloc_coherent(NULL, dma_size, &handle_0, GFP_KERNEL);
addr_1 = (unsigned int *)dma_alloc_coherent(NULL, dma_size, &handle_1, GFP_KERNEL);
sg_dma_address(&sg[0]) = handle_0;
sg_dma_len(&sg[0]) = dma_size;
sg_dma_address(&sg[1]) = handle_1;
sg_dma_len(&sg[1]) = dma_size;
memset(&config, 0, sizeof(config));
config.direction = DMA_DEV_TO_MEM;
config.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
config.src_maxburst = DMA_MAX_BURST_SIZE / DMA_SLAVE_BUSWIDTH_4_BYTES;
dmaengine_slave_config(dma_dev->chan, &config);
chan_desc = dmaengine_prep_slave_sg(dma_dev->chan, &sg[0], 2, DMA_DEV_TO_MEM, DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
chan_desc->callback = sync_callback;
chan_desc->callback_param = dma_dev->cmp;
init_completion(dma_dev->cmp);
dmaengine_submit(chan_desc);
dma_async_issue_pending(dma_dev->chan);
dma_free_coherent(NULL, dma_size, addr_0, handle_0);
dma_free_coherent(NULL, dma_size, addr_1, handle_1);
This works fine for a single run through the scatterlist and then calls the call_back at sync_callback. My thought was to chain the scatterlist in a loop but I won't get the callback.
IS THERE A WAY to have a callback for each descriptor in the scatterlist? I wondered if I could use dmaengine_prep_slave_cyclic (calls callback after every transaction) but it looks to me like this is for a single buffer when reviewing dmaengine.h. Looking at DMA Engine API Guide it looks like there is another option dmaengine_prep_interleaved_dma using a dma_interleaved_template that sounds interesting but hard to find info about.
In the end I just want to signal the user space in some manner as to what buffer is ready.

Transferring an Image using TCP Sockets in Linux

I am trying to transfer an image using TCP sockets using linux. I have used the code many times to transfer small amounts but as soon as I tried to transfer the image it only transfered the first third. Is it possible that there is a maximum buffer size for tcp sockets in linux? If so how can I increase it? Is there a function that does this programatically?
I would guess that the problem is on the receiving side when you read from the socket. TCP is a stream based protocol with no idea of packets or message boundaries.
This means when you do a read you may get less bytes than you request. If your image is 128k for example you may only get 24k on your first read requiring you to read again to get the rest of the data. The fact that it's an image is irrelevant. Data is data.
For example:
int read_image(int sock, int size, unsigned char *buf) {
int bytes_read = 0, len = 0;
while (bytes_read < size && ((len = recv(sock, buf + bytes_read,size-bytes_read, 0)) > 0)) {
bytes_read += len;
}
if (len == 0 || len < 0) doerror();
return bytes_read;
}
TCP sends the data in pieces, so you're not guaranteed to get it all at once with a single read (although it's guaranteed to stay in the order you send it). You basically have to read multiple times until you get all the data. It also doesn't know how much data you sent on the receiver side. Normally, you send a fixed size "length" field first (always 8 bytes, for example) so you know how much data there is. Then you keep reading and building a buffer until you get that many bytes.
So the sender would look something like this (pseudocode)
int imageLength;
char *imageData;
// set imageLength and imageData
send(&imageLength, sizeof(int));
send(imageData, imageLength);
And the receiver would look like this (pseudocode)
int imageLength;
char *imageData;
guaranteed_read(&imageLength, sizeof(int));
imageData = new char[imageLength];
guaranteed_read(imageData, imageLength);
void guaranteed_read(char* destBuf, int length)
{
int totalRead=0, numRead;
while(totalRead < length)
{
int remaining = length - totalRead;
numRead = read(&destBuf[totalRead], remaining);
if(numRead > 0)
{
totalRead += numRead;
}
else
{
// error reading from socket
}
}
}
Obviously I left off the actual socket descriptor and you need to add a lot of error checking to all of that. It wasn't meant to be complete, more to show the idea.
The maximum size for 1 single IP packet is 65535, which is extremely close to the number you are hitting. I doubt that is a coincidence.

Resources