Is there a known (or maybe unknown) bug regarding the size of packets in the AF-XDP socket framework (+ libbpf)?
I am experiencing a strange packet loss for my application:
IPv4/UDP/RTP packet stream with all packets being the same size (1442 bytes): no packet loss
IPv4/UDP/RTP packet stream where pretty much all packets are the same size (1492 bytes) except a special "marker" packet (only 357 bytes but they are also IPv4/UDP-packets): all marker packets get lost
I added a bpf_printk statement in my XDP-Kernelprogram:
const int len = bpf_ntohs(iph->tot_len);
if(len < 400) {
bpf_printk("FOUND PACKET LEN < 400: %d.\n", len);
}
This output is never observed via sudo cat /sys/kernel/debug/tracing/trace_pipe. So these small RTP-marker packets aren't even received by my kernel filter - no wonder why I don't receive them in userspace.
ethtool -S <if> shows me this number: rx_256_to_511_bytes_phy. This number is increasing in a similar rate as marker-packets should come in (about 30/s). So this means that my NIC does receive the packets but my XDP-program doesn't - why?
Any idea what could be the cause of this problem?
First, bpf_printk() doesn't always work for me. You may want to take a look at this snippet (kernel-space code):
// Nicer way to call bpf_trace_printk()
#define bpf_custom_printk(fmt, ...) \
({ \
char ____fmt[] = fmt; \
bpf_trace_printk(____fmt, sizeof(____fmt), \
##__VA_ARGS__); \
})
// print:
bpf_custom_printk("This year is %d\n", 2020);
// output: sudo cat /sys/kernel/debug/tracing/trace_pipe
Second: May be the packet entered the other NIC queue. You may want to use vanilla code from xdp-tutorial and add the kernel tracing from the above snippet to print size of packet, then compile and run the example program with -q 1 for queue number 1 for example.
A way to get size of packet:
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
size_t size_pkt = data - data_end;
bpf_custom_printk("Packet size %d\n", size_pkt);
Related
I'm writing a netfilter module, that deeply inspect the packet. However, during tests I found that netfilter module is not receiving the packet in full.
To verify this, I wrote the following code to dump packet retrieved on port 80 and write the result to dmesg buffer:
const struct iphdr *ip_header = ip_hdr(skb);
if (ip_header->protocol == IPPROTO_TCP)
{
const struct tcphdr *tcp_header = tcp_hdr(skb);
if (ntohs(tcp_header->dest) != 80)
{
return NF_ACCEPT;
}
buff = (char *)kzalloc(skb->len * 10, GFP_KERNEL);
if (buff != NULL)
{
int pos = 0, i = 0;
for (i = 0; i < skb->len; i ++)
{
pos += sprintf(buff + pos, "%02X", skb->data[i] & 0xFF);
}
pr_info("(%pI4):%d --> (%pI4):%d, len=%d, data=%s\n",
&ip_header->saddr,
ntohs(tcp_header->source),
&ip_header->daddr,
ntohs(tcp_header->dest),
skb->len,
buff
);
kfree (buff);
}
}
In virtual machine running locally, I can retrieve the full HTTP request; On Alibaba cloud, and some other OpenStack based VPS provider, the packet is cut in the middle.
To verify this, I execute curl http://VPS_IP on another VPS, and I got the following output in dmesg buffer:
[ 1163.370483] (XXXX):5007 --> (XXXX):80, len=237, data=451600ED000040003106E3983D87A950AC11D273138F00505A468086B44CE19E80180804269300000101080A1D07500A000D2D90474554202F20485454502F312E310D0A486F73743A2033392E3130372E32342E37370D0A4163636570743A202A2F2A0D0A557365722D4167656E743A204D012000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000001E798090F5FFFF8C0000007B00000000E0678090F5FFFF823000003E00000040AE798090F5FFFF8C0000003E000000000000000000000000000000000000000000000000000000000000
When decoded, the result is like this
It's totally weird, everything after User-Agent: M is "gone" or zero-ed. Although the skb->len is 237, but half of the packet is missing.
Any ideas? Tried both PRE_ROUTING and LOCAL_IN, no changes.
It appears that sometimes you are getting a linear skb, and sometimes your skb is not linear. In the latter case you are not reading the full data contents of an skb.
If skb->data_len is zero, then your skb is linear and the full data contents of the skb is in skb->data. If skb->data_len is not zero, then your skb is not linear, and skb->data contains just the the first (linear) part of the data. The length of this area is skb->len - skb->data_len. skb_headlen() helper function calculates that for convenience. skb_is_nonlinear() helper function tells in an skb is linear or not.
The rest of the data can be in paged fragments, and in skb fragments, in this order.
skb_shinfo(skb)->nr_frags tells the number of paged fragments. Each paged fragment is described by a data structure in the array of structures skb_shinfo(skb)->frags[0..skb_shinfo(skb)->nr_frags]. skb_frag_size() and skb_frag_address() helper functions help dealing with this data. They accept the address of the structure that describes a paged fragment. There are other useful helper functions depending on your kernel version.
If the total size of data in paged fragments is less than skb->data_len, then the rest of the data is in skb fragments. It's the list of skb which is attached to this skb at skb_shinfo(skb)->frag_list (see skb_walk_frags() in the kernel).
Please note that there may be that there's no data in the linear part and/or there's no data in the paged fragments. You just need to process data piece by piece in the order just described.
On a Linux 2.6.32, i'm looking at /proc/net/tcp and wondering what is the unit of tx_queue and rx_queue.
I can't find this information about receive-queue and transmit-queue in https://www.kernel.org/doc/Documentation/networking/proc_net_tcp.txt
Nor in man 5 proc which shows only:
The "tx_queue" and "rx_queue" are the outgoing and incoming data queue
in terms of kernel memory usage.
Is it bytes? or number of buffers? or maybe i missed a great documentation about this?
Thanks
Short answer - These count bytes. By running netperf TCP_RR with different sizes you could see the exact value of 1 count (only 1 packet in the air in a given time). This value would always show the packet size.
More info:
According to this post:
tx_queue:rx_queue
The size of the transmit and receive queues.
This is per socket. For TCP, the values are updated in get_tcp4_sock() function. It is a bit different in 2.6.32 and 4.14, but the idea is the same. According to the socket state the rx_queue value is updated to either sk->sk_ack_backlog or tp->rcv_nxt - tp->copied_seq. The second value might be negative and is fixed in later kernels to 0 if it is. sk_ack_backlog counts the unacked segments, this is a bit strange since this doesn't seem to be in bytes. Probably missing something here.
From tcp.h:
struct tcp_sock {
...
u32 rcv_nxt; /* What we want to receive next */
u32 copied_seq; /* Head of yet unread data */
Both count in bytes, so tp->rcv_nxt - tp->copied_seq is counting the pending bytes in the receive buffer for incoming packets.
tx_queue is set to tp->write_seq - tp->snd_una. Again from tcp.h:
struct tcp_sock {
...
u32 snd_una; /* First byte we want an ack for */
u32 write_seq; /* Tail(+1) of data held in tcp send buffer */
Here is a bit more clear to see the count is in bytes.
For UDP, it is simpler. The values are updated in udp4_format_sock():
static void udp4_format_sock(struct sock *sp, struct seq_file *f,
int bucket)
{
...
seq_printf(f, "%5d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d",
bucket, src, srcp, dest, destp, sp->sk_state,
sk_wmem_alloc_get(sp),
sk_rmem_alloc_get(sp),
sk_wmem_alloca_get and sk_rmem_alloc_get return sk_wmem_alloc and sk_rmem_alloc respectively, both are in bytes.
Hope this helps.
From this example
void process_packet(u_char *args, const struct pcap_pkthdr *header, const u_char *buffer)
{
int size = header->len;
//Get the IP Header part of this packet , excluding the ethernet header
struct iphdr *iph = (struct iphdr*)(buffer + sizeof(struct ethhdr));
++total;
switch (iph->protocol) //Check the Protocol and do accordingly...
{
case 1: //ICMP Protocol
++icmp;
print_icmp_packet( buffer , size);
break;
case 2: //IGMP Protocol
++igmp;
break;
case 6: //TCP Protocol
++tcp;
print_tcp_packet(buffer , size);
break;
case 17: //UDP Protocol
++udp;
print_udp_packet(buffer , size);
break;
default: //Some Other Protocol like ARP etc.
++others;
break;
}
printf("TCP : %d UDP : %d ICMP : %d IGMP : %d Others : %d Total : %d\r", tcp , udp , icmp , igmp , others , total);
}
variable size is, I guess, the size of the header. How do I get the size of the whole packet?
Also, how do I convert uint32_t IP addresses to human readable IP addresses of the form xxx.xxx.xxx.xxx?
variable size is, I guess, the size of the header.
You have guessed incorrectly.
To quote the pcap man page:
Packets are read with pcap_dispatch() or pcap_loop(), which
process one or more packets, calling a callback routine for each
packet, or with pcap_next() or pcap_next_ex(), which return the
next packet. The callback for pcap_dispatch() and pcap_loop() is
supplied a pointer to a struct pcap_pkthdr, which includes the
following members:
ts a struct timeval containing the time when the packet
was captured
caplen a bpf_u_int32 giving the number of bytes of the
packet that are available from the capture
len a bpf_u_int32 giving the length of the packet, in
bytes (which might be more than the number of bytes
available from the capture, if the length of the
packet is larger than the maximum number of bytes to
capture).
so "len" is the total length of the packet. However, there may not be "len" bytes of data available; if the capture was done with a "snapshot length", for example with tcpdump, dumpcap, or TShark using the -s option, the packet could have been cut short, and "caplen" would indicate how many bytes of data you actually have.
Note, however, that Ethernet packets have a minimum length of 60 bytes (not counting the 4-byte FCS at the end, which you probably won't get in your capture), including the 14-byte Ethernet header; this means that short packets must be padded. 60-14 = 46, so if a host sends, over Ethernet, an IP packet that's less than 46 bytes long, it must pad the Ethernet packet.
This means that the "len" field gives the total length of the Ethernet packet, but if you subtract the l4 bytes of Ethernet header from "len", you won't get the length of the IP packet. To get that, you'll need to look in the IP header at the "total length" field. (Don't assume it'll be less than or equal to the value of "len" - 14 - a machine might have sent an invalid IP packet.)
Also, how do I convert uint32_t IP addresses to human readable IP addresses of the form xxx.xxx.xxx.xxx?
By calling routines such as inet_ntoa(), inet_ntoa_r(), or inet_ntop().
No, header->len is length of this packet, just what you want.
see header file pcap.h
struct pcap_pkthdr {
struct timeval ts; /* time stamp */
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};
you can use sprintf() to convert uint32_t ip field to xxx.xxx.xxx.xxx
I have an application which runs on Linux (2.6.38.8), using libpcap (>1.0) to capture packets streamed at it over Ethernet. My application uses close to 100% CPU and I am unsure whether I am using libpcap as efficiently as possible.
I am battling to find any correlation between the pcap tunables and performace.
Here is my simplified code (error checking etc. omitted):
// init libpcap
pcap_t *p = pcap_create("eth0", my_errbuf);
pcap_set_snaplen(p, 65535);
pcap_set_promisc(p, 0);
pcap_set_timeout(p, 1000);
pcap_set_buffer_size(p, 16<<20); // 16MB
pcap_activate(p);
// filter
struct bpf_program filter;
pcap_compile(p, &filter, "ether dst 00:11:22:33:44:55", 0, 0);
pcap_setfilter(p, &filter);
// do work
while (1) {
int ret = pcap_dispatch(p, -1, my_callback, (unsigned char *) my_args);
if (ret <= 0) {
if (ret == -1) {
printf("pcap_dispatch error: %s\n", pcap_geterr(p));
} else if (ret == -2) {
printf("pcap_dispatch broken loop\n");
} else if (ret == 0) {
printf("pcap_dispatch zero packets read\n");
} else {
printf("pcap_dispatch returned unexpectedly");
}
} else if (ret > 1) {
printf("processed %d packets\n", ret);
}
}
The result when using a timeout of 1000 miliseconds, and buffer size of 2M, 4M and 16M is the same at high data rates (~200 1kB packets/sec): pcap_dispatch consistently returns 2. According to the pcap_dispatch man page, I would expect pcap_dispatch to return either when the buffer is full or the timeout expires. But with a return value of 2, neither of these conditions should be met as only 2kB of data has been read, and only 2/200 seconds have passed.
If I slow down the datarate (~100 1kB packets/sec), pcap_dispatch returns between 2 and 7, so halving the datarate affects how many packets are processed per pcap_dispatch. (I think the more packets the better, as this means less context switching between OS and userspace - is this true?)
The timeout value does not seem to make a difference either.
In all cases, my CPU usage is close to 100%.
I am starting to wonder if I should be trying the PF_RING version of libpcap, but from what I've read on SO and libpcap mailing lists, libpcap > 1.0 does the zero copy stuff anyway, so maybe no point.
Any ideas, pointers greatly appreciated!
G
I am trying to transfer an image using TCP sockets using linux. I have used the code many times to transfer small amounts but as soon as I tried to transfer the image it only transfered the first third. Is it possible that there is a maximum buffer size for tcp sockets in linux? If so how can I increase it? Is there a function that does this programatically?
I would guess that the problem is on the receiving side when you read from the socket. TCP is a stream based protocol with no idea of packets or message boundaries.
This means when you do a read you may get less bytes than you request. If your image is 128k for example you may only get 24k on your first read requiring you to read again to get the rest of the data. The fact that it's an image is irrelevant. Data is data.
For example:
int read_image(int sock, int size, unsigned char *buf) {
int bytes_read = 0, len = 0;
while (bytes_read < size && ((len = recv(sock, buf + bytes_read,size-bytes_read, 0)) > 0)) {
bytes_read += len;
}
if (len == 0 || len < 0) doerror();
return bytes_read;
}
TCP sends the data in pieces, so you're not guaranteed to get it all at once with a single read (although it's guaranteed to stay in the order you send it). You basically have to read multiple times until you get all the data. It also doesn't know how much data you sent on the receiver side. Normally, you send a fixed size "length" field first (always 8 bytes, for example) so you know how much data there is. Then you keep reading and building a buffer until you get that many bytes.
So the sender would look something like this (pseudocode)
int imageLength;
char *imageData;
// set imageLength and imageData
send(&imageLength, sizeof(int));
send(imageData, imageLength);
And the receiver would look like this (pseudocode)
int imageLength;
char *imageData;
guaranteed_read(&imageLength, sizeof(int));
imageData = new char[imageLength];
guaranteed_read(imageData, imageLength);
void guaranteed_read(char* destBuf, int length)
{
int totalRead=0, numRead;
while(totalRead < length)
{
int remaining = length - totalRead;
numRead = read(&destBuf[totalRead], remaining);
if(numRead > 0)
{
totalRead += numRead;
}
else
{
// error reading from socket
}
}
}
Obviously I left off the actual socket descriptor and you need to add a lot of error checking to all of that. It wasn't meant to be complete, more to show the idea.
The maximum size for 1 single IP packet is 65535, which is extremely close to the number you are hitting. I doubt that is a coincidence.