how to find which packets got dropped - linux

I'm getting thousands of dropped packages from a Broadcom Network Card:
eth1 Link encap:Ethernet HWaddr 01:27:B0:14:DA:FE
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2746252626 errors:0 dropped:1151734 overruns:0 frame:0
TX packets:4109502155 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:427998700000 (408171.3 Mb) TX bytes:3530782240047 (3367216.3 Mb)
Interrupt:40 Memory:d8000000-d8012700
Here is the installed version:
filename: /lib/modules/2.6.27.54-0.2-default/kernel/drivers/net/bnx2.ko
version: 1.8.0
license: GPL
description: Broadcom NetXtreme II BCM5706/5708/5709 Driver
The packets get dropped in bulks ranging from 500 to 5000 packets several times an hour. The Server (running Postgres) is running fine - just the dropps are annoying.
After trying lots of different things, I'm asking: How may I find out where the packets came from and why were they dropped?

A dropped packet means that the buffer that is used to store the packet for forwarding/processing is full. The act of looking into the packet's data for information implies that you have the data to look at in the first place (which you don't, because there was no room to store it).
A nice way around this, so you can see what data is being dropped, is to look through a dump of your traffic for the TCP retransmission requests leaving your server. When a TCP packet is missing, for whatever reason, your server is going to ask for it to be re-sent. The retransmit will give you the conversation context that you're looking for.
I'd actually suggest taking a look at the switch/router that your server is connected to. It will be able to give you a nice idea of the loss and throughput over the interface to your server, letting you diagnose, for example, if your card is too slow for the wire.
EDIT
This blog post cites a tool called dropwatch, which may give you some clues as well.

You may ran into https://www.novell.com/support/kb/doc.php?id=7007165.
quote:
Beginning with kernel 2.6.37, it has been changed the meaning of dropped packet count. Before, dropped packets was most likely due to an error. Now, the rx_dropped counter shows statistics for dropped frames because of:
Softnet backlog full -- (Measured from /proc/net/softnet_stat)
Bad / Unintended VLAN tags
Unknown / Unregistered protocols
IPv6 frames when the server is not configured for IPv6
If any frames meet those conditions, they are dropped before the protocol stack and the rx_dropped counter is incremented.

(For the benefit of those that come to this via a search) I've seen the same problem (also with a bnx2 module, IIRC).
You might try turning off the irqbalance service. In my case, it completely stopped the solution.
Please also note that not so long ago, there were plenty of updates (RHEL 6) for irqbalance. Firmware updates should also be checked for both main system and the ethernet board(s).
We were seeing this only a very large subnet with a very large amount of broadcast/multicast activity. We weren't seeing this on the same equipment on a less noisy -- but still very active -- part of the network.
Potentially, setting the ethernet ring buffer size for the NIC can also be of use. I know there were some alterations for sysctl on that busy network...

Related

TCP_MAXSEG inaccurate? (Was: Linux path MTU probing not working on accept():ed socket if requested using setsockopt())

Question update: In addition to the problem below, it seems our client/server application using the Linux PLPMTUD mechanism gets too large path MTU. Has anyone seen this, i.e. actual path MTU being 1500, but getsockopt() w TCP_MAXSEG returning the MTU:s of the endpoints, in our case 3000? I have tried turning of GRO, GSO and TSO with ethtool but the error persists. Normal ping only manages to push through packets 1472 bytes or smaller. Also worth mentioning is that PLPMTUD works perfectly for smaller MTU:s. For example, w endpoints at 1500 MTU and one interface of the intermediate router set to e.g 1200 MTU, the kernel TCP probes and reports correct TCP_MAXSEG (1200 - headers).
I am using the Linux RFC4821-compliant packetization layer path MTU discovery in an application. Basically, the client does a setsockopt on a TCP socket:
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &sopt, sizeof(sopt));
with option value set to IP_PMTUDISC_PROBE. The setsockopt() does not return an error.
The client sends large tcp packets to a discard server, and the path MTU is calibrated by Linux kernel - tcpdump shows tcp packets with DF bit set being sent, packet size varies until the kernel knows the path MTU. However, to get this to work in the other direction (listening server accept:ing connections from clients, sending data and calibrating PMTU in direction from server to client) I have to set the global option for tcp path mtu discovery, /proc/sys/net/ipv4/tcp_mtu_probing. If I do not, server will stupidly continue to send too large packets, which get discarded by an intermediate router without ICMP sent back. Both endpoints have an MTU set to 3000, while the intermediate hops have MTU 1500.
I hope someone has an idea on what goes wrong. If more info is needed, let me know and I edit the question. Problems exist on both Linux kernel 4.2.0 and 3.19.0, both are stock Kubuntu LTS kernels. (x86/x86-64)
I do set the same socket option server-side as well, on all accept:ed sockets, before sending data in reverse direction.
FWIW, I have found workarounds/solutions for the problems, will do more testing but shortly describe my findings here, in case it helps someone else.
The problem with not being able to set path mtu discovery per socket was fixed, brute force, by enabling it system-wide during execution of my program, then disabling again.
The second problem, of incorrect path mtu returned by getsockopt TCP_MAXSEG, was fixed by waiting for TCP ACK of sent TCP data, also using getsockopt (tcp_info.tcpi_unacked). That way, I can be sure that probing has finished before I get TCP_MAXSEG.
Finally, there was a patchset for improving path mtu probing accuracy merged to mainline Linux kernel in March 2015. Without those patches, the probing is very imprecise. Patchset is part of 4.1.y-series kernels and onward.

One-to-many Inter Process Communication on Linux

I have a 'server' process that produces some logs. I want the user (or some other service) to be able to view that log stream (like tail -f), but I don't want to write those logs to the filesystem. Can I do this on Linux?
My first attempt was to use UDP, on the loopback interface. The server sends packets to localhost on port 12345, and clients can bind to that port to receive them. Doesn't work. Because only one client can bind to the same socket. Ah! But you might say use SO_REUSE_ADDR, that lets two clients bind to one port, but only one receives the messages.
Next up, I tried UDP multicast on the loopback interface. That one didn't get so far, as my kernel doesn't support multicast on the loopback interface. According to ifconfig:
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:186 errors:0 dropped:0 overruns:0 frame:0
TX packets:186 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11904 (11.6 KiB) TX bytes:11904 (11.6 KiB)
Note the lack of MULTICAST (or BROADCAST, indeed), above.
Does anyone have any ideas? Could I use named pipes, or Unix Domain Sockets to solve this?
I'd like to avoid anything that allows the (unpriviledged) listeners to affect the (privileged) server. I'd rather drop logs than block the server, for example.
I'm doing all this in Python, if that makes any difference at all.
You could take a look at ZeroMQ. What you're describing is a need to a publisher/subscriber pattern, which is exactly the kind of thing ZeroMQ does really quite elegantly. It has the added advantage of being very flexible on what sort of transport is used underneath; IPC, TCP, etc. That makes putting bits of your program elsewhere on a network quite simple. Using ZeroMQ you will end up with very simple source code, the complexity all being hidden inside the zmq library. You could start by taking a look at this part of the guide
You could also consider NanoMSG (the up and coming ZeroMQ-done-better), though I'm not sure that's got Python bindings as yet.
I can think of two generic approaches.
1) Using POSIX shared memory objects.
See the shm_open(3) manual page for more information.
Your application would create a shared memory object, where it will write its log messages, and any client application can open a shared memory object, and read it. Although the POSIX shared memory API looks like a filesystem-based API, it's not.
Now, bear in mind, that you're going to get just a chunk of memory, of some size that you request. You'll have to figure out how your application will structure, and manage this chunk of memory in some meaningful way that your client applications can parse, and poll for changed contents.
2)
Your application bears the burden of opening and listening on a localhost socket, or a filesystem domain socket, that any client can connect to, and your application will simply write its log messages to every client connection that currently exists.
This is a bit tricky to get right. Your application will need to be able to constantly accept new connections from clients, whenever they come in, write messages to all concurrent connections, detect when some client gets stuck, does not read from its end of the socket, hence making the local sockets internal buffers full, so a blocking write would block and hang the main application; hence all writes must be non-blocking rights, and the application would automatically close any socket that becomes full, etc... etc... etc...
Take a look at message queue pattern, popular solutions are rabbitMQ or Redis.
They all have python client !

Python 3 - Cannot receive IPv6 packets (UDP - linux)

I have a script that is trying to receive IPv6 packets, but it fails to receive any.
First off, here is my ethernet configuration from ifconfig.
eth1 Link encap:Ethernet HWaddr f8:b1:56:9a:cf:ef
inet addr:192.168.1.90 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::fab1:56ff:fe9a:cfef/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:751359199 errors:38 dropped:10874 overruns:0 frame:35
TX packets:23407 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1033523557150 (1.0 TB) TX bytes:2002869 (2.0 MB)
Interrupt:20 Memory:ef400000-ef420000
I have two network cards, but am using one for internet and one for testing. The second card is connect to a device that sends ethernet packets. I am configuring that device to send IPv6 packets to address fe80::fab1:56ff:fe9a:cfef and port 46780 (however, I can configure it to send to any IPv6 address and any port). I wrote a python script to receive these packets, but I either get an error, or my script doesn't find the packets. I confirmed these packets through wireshark, and through using a raw python socket.
Here is a list of things I have tried and the various errors/problems I encounter.
If I bind to address "::1", I am able to bind to the address. However, I never receive any IPv6 packets.
I tried using socket.getaddrinfo() and then use the returned information and bind to that, however when I try to do so I get the error "Invalid argument"
info = socket.getaddrinfo(host_ipv6_addr, port_num, socket.AF_INET6,
socket.SOCK_DGRAM, 0, socket.AI_PASSIVE)
rtp_socket.bind(info[0][4])
socket.getaddrinfo returns [(10, 2, 17, '', ('fe80::fab1:56ff:fe9a:cfef', 46780, 0, 0))]
If I try to bind directly to my IPv6 address, I also received "Invalid argument". However, when I changed the scope from 0 to 5, I instead received the error "Cannot assign request address".
rtp_socket.bind( (host_ipv6_addr, port_num, 0, 5))
Any insight would be greatly appreciated. I'm guessing at this point that I don't have my ethernet card setup properly or something.
UPDATE:
Using Michael Hampton's answer, I solved my problem by using the information from socket.getaddrinfo with the IP address being "fe80::fab1:56ff:fe9a:cfef%eth1" and sticking the results into rtp_socket.bind(). The scope ID went from 0 to 3.
You're trying to bind to a link-local address but you have forgotten to include the scope ID (in this case, %eth1).
So you should be binding to address fe80::fab1:56ff:fe9a:cfef%eth1.

Do multiple programs listening to multicast cause more network traffic?

I have several programs listening to the same multicast stream, I'm wondering will this doubling the traffic compared with only one program listening or the traffic/bandwidth usage are the same? thanks!
The short answer is no, the amount of traffic is the same. I'll caveat that with "in most cases". Multicast packets are written to the wire using a MAC address constructed from the multicast group address. Joining a multicast group is essentially telling the NIC to listen to the appropriate MAC address. This makes each listener receive the same ethernet frame. The caveat has to do with how multicast routing may or may not work. If you have a multicast aware router then multicast traffic may traverse the router onto other networks if someone has joined the group on another subnet.
I recommend reading "TCP/IP Illustrated, Volume 1" if you plan on doing a lot of network programming. This is the best way to really understand how all of the protocols fit together.
Are the clients on the same network?
For wireless 802.11 multicast, it depends on the implementation of Multicast at the wireless access point.
Some wireless access points do multicast to unicast conversion at the datalink layer and thus send a data separately to EACH client that has joined the multicast group.
If the AP is not doing unicast conversion, generally, your network utilization does not increase.

UDP packet greater than 1500 bytes dropped

I'm developing a tftp client and server and I want to dynamically select the udp payload size to boost transfer performance.
I have tested it with two linux machines ( one has a gigabit ethernet card, the other a fast ethernet one ). I changed the MTU of the gigabit card to 2048 bytes and left the other to 1500.
I have used setsockopt(sockfd, IPPROTO_IP, IP_MTU_DISCOVER, &optval, sizeof(optval)) to set the MTU_DISCOVER flag to IP_PMTUDISC_DO.
From what I have read this option should set the DF bit to one and so it should be possible to find the minimum MTU of the network ( the MTU of the host that has the lowest MTU ). However this thing only gives me an error when I send a packet which size is bigger than the MTU of the machine from which I'm sending packets.
Also the other machine ( the server in this case ) doesn't receive the oversized packets ( the server has a MTU of 1500 ). All the UDP packets are dropped, the only way is to send packets of 1472 bytes.
Why the hosts do this? From what I have read, if I send a packet larger than MTU, the ip layer should fragment it.
I fail to see the problem. You are setting the "don't fragment" bit, and you send a package smaller than the sending host's MTU, but larger than the receiving host's MTU. Of course nobody will fragment here (doing so would violate the DF bit). Instead, the sending host should get an ICMP message back.
Edit: IP specifies that an ICMP error message type 3 (destination unreachable) code 4 (Fragmentation Required but DF Bit Is Set) is sent to the originating host at the point where the fragmentation would have occurred. The TCP layer handles this on its own for PMTU discovery. On connection-less sockets, Linux reports the error in the socket's error queue if the IP_RECVERR option is activated; see ip(7).
That "DF bit" you're setting, stands for "Don't Fragment". The IP layer should not be expected to fragment packets when you've told it not to.
It is not correct to run hosts with different interface MTUs on the same subnet1.
This is a host/network misconfiguration, and IP path MTU discovery is not expected to work correctly in this situation.
If you wish to test your application's path MTU discovery, you will need to set up multiple subnets connected by a router2, with different MTUs. In this situation, the router is the device that will pick up the MTU mismatch, and send back an ICMP "Fragmentation Needed" error.
1. Well, technically, same broadcast domain.
2. The devices sold as "home routers" are really router/switches - they route between the WAN and the LAN, but switch between the ethernet ports on the LAN. This isn't sufficient to separate networks with different MTUs.

Resources