What are meanings of fields in /proc/net/dev? - linux

The Linux file /proc/net/dev reads like this:
[me#host ~]$ cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
What do fields drop and errs mean?
Are some errs packets also counted in the drop packets?
Why is a packet considered errs , is it because that it suffers from checksum error?
Why is a packet dropped? Is it because that the system has no enough buffer of because there is some burst on the NIC?
Do the two fields take packets that are destined to another host (e.g. when the NIC is working in promiscuous mode) into consider?

You can have a look at net/core/dev.c in the source tree to see what it means:
seq_printf(seq, "%6s:%8lu %7lu %4lu %4lu %4lu %5lu %10lu %9lu "
"%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
stats->rx_dropped + stats->rx_missed_errors,
stats->rx_length_errors + stats->rx_over_errors +
stats->rx_crc_errors + stats->rx_frame_errors,
stats->tx_carrier_errors + stats->tx_aborted_errors +
stats->tx_window_errors + stats->tx_heartbeat_errors,
receive errors means any kind of invalid packet, e.g. invalid length or invalid checksum
transmit errors are
carrier errors
aborted errors
window errors
heartbeat errors
(whatever they all mean)
And yes, I think drops means when the device dropped a packet because it ran out of buffer space.

According to http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html, the meanings of each of the columns are:
bytes The total number of bytes of data transmitted or received by the interface.
packets The total number of packets of data transmitted or received by the interface.
errs The total number of transmit or receive errors detected by the device
drop The total number of packets dropped by the device driver.
fifo The number of FIFO buffer errors.
frame The number of packet framing errors.
colls The number of collisions detected on the interface.
compressed The number of compressed packets transmitted or received by the device
driver. (This appears to be unused in the 2.2.15 kernel.)
carrier The number of carrier losses detected by the device driver.
multicast The number of multicast frames transmitted or received by the device

Since noone has answered for almost six months, I feel free to speculate:
I don't think the errs and drops overlap. I also think that errs are checksum or other bad data in a received packet (i.e. not enough data to constitute a whole packet). Further, I believe drops only apply to outgoing packages - how would the system know about dropped packages somewhere else?


When using recv(n), with n greather than the MTU are you guaranteed to read at least a whole layer 2 frame?

I was wondering, imagine if there is no data to read from a TCP socket, then a whole frame of 1492 bytes arrives (full). In your code (C or any language supporting TCP) you have let's say recv 4096 bytes, will the OS guarantee that the recv reads the whole 1492 bytes, or is it possible that the loading of the frame in memory and recv are "interleaved", so the recv may get less ?
TCP is a stream oriented protocol. Data are received in order but you must not do any assumption about how many times you have to call recv until you receive all your data.
It is up to your application to repeat the calls to recv until you know you have received what you need.
(1) TCP is stream-oriented protocol. This means that it accepts a stream of data from the upper layer on the sender and returns the stream of data to the upper layer on the receiver. TCP itself receives packets from IP layer, and then reconstructs the stream. That is at some points packets cease to exist. In theory it is possible that somewhere during this reconstructed stream, only half of the incomming packet is copied in buffer, but it seems to me pretty unlikely that this would happen.
Now, linux man page states
The receive calls normally return any data available up to the requested amount,
I would interpret it as "if one packet has arrived (correctly, in order, etc), you will get the whole packet worth of data". But there is no guarantee.
On the other hand Windows docs states:
recv will return as much data as is currently available—up to the size of the buffer specified.
Which sounds more like the guarantee.
Note, however, that the data will only be returned if the packet is received correctly, and it is next in-order packet (with next expected sequence numbers).
(2) Now, TCP layer works on complete packets. It is actually impossible for it to do interleaving or anything. Ethernet has a checksum, which cannot be computed unless the packet was received completely. Packets with incorrect Ethernet checksum should be filtered out by the network card. TCP also has a checksum which requires all packet data to compute. So, if the network card has passed the packet to your OS, then data should be available.
(3) I don't think you can assume that if the packet is received, it is immediatelly available. A pretty common feature of network cards is TCP segmentation offload, which reconstructs part of the stream and results in network card passing one TCP packet that was reconstructed from multiple TCP packets. There are other things that can be in place to reduce the number of interrupts, which more or less result in several packets comming at once. So, the more likely situation is that you will have maybe some delay and then receive data from several packets at once.
The point is, the opposite of what you described is likely to happen. However, I still would not write an application that makes any assumptions about how large a chunk of data is available at a time. This negates the concept of a stream.

TCP Sockets send buffer size efficiency

When working with WinSock or POSIX TCP sockets (in C/C++, so no extra Java/Python/etc. wrapping), is there any efficiency pro/cons to building up a larger buffer (e.g. say upto 4KB) in user space then making as few calls to send as possible to send that buffer vs making multiple smaller calls directly with the bits of data (say 1-1000 bytes), other the the fact that for non-blocking/asynchronous sockets the single buffer is potentially easier for me to manage.
I know with recv small buffers are not recommended, but I couldn't find anything for sending.
e.g. does each send call on common platforms go to into kernel mode? Could a 1 byte send actually result in a 1 byte packet being transmitted under normal conditions?
As explained on TCP Illustrated Vol I, by Richard Stevens, TCP divides the send buffer in near to optimum segments to fit in the maximum packet size along the path to the other TCP peer. That means that it will never try to send segments that will be fragmented by ip along the route to destination (when a packet is fragmented at some ip router, it sends back an IP fragmentation ICMP packet and TCP will take it into account to reduce the MSS for this connection). That said, there is no need for larger buffer than the maximum packet size of the link level interfaces you'll have along the path. Having one, let's say, twice or thrice longer, makes you sure that TCP will not stop sending as soon as it receives some acknowledge of remote peer, because of not having its buffer filled with data.
Think that the normal interface type is ethernet and it has a maximum packet size of 1500 bytes, so normally TCP doesn't send a segment greater than this size. And it normally has an internall buffer of 8Kb per connection, so there's little sense in adding buffer size at kernel space for that (if this is the only reason to have a buffer in kernel space).
Of course, there are other factors that force you to use a buffer in user space (for example, you want to store the data to send to your peer process somewhere, as there's only 8Kb data in kernel space to buffer, and you will need more space to be able to do some other processes) An example: ircd (the Internet Relay Chat daemon) uses write buffers of up to 100Kb before dropping a connection because the other side is not receiving/acknowledging that data. If you only write(2) to the connection, you'll be put on wait once the kernel buffer is full, and perhaps that's not what you want.
The reason to have buffers in user space is because TCP makes also flow control, so when it's not able to send data, it has to be put somewhere to cope with it. You'll have to decide if you need your process to save that data up to a limit or you can block sending data until the receiver is able to receive again. The buffer size in kernel space is limited and normally out of control for the user/developer. Buffer size in user space is limited only by the resources allowable to it.
Receiving/sending small chunks of data in a TCP connection is not recommendable because of the increased overhead of TCP handshaking and headers impose. Suppose a telnet connection in which for each character sent, a header for TCP and other for IP is added (20 bytes min for TCP, 20 bytes min for IP, 14 bytes for ethernet frame and 4 for the ethernet CRC) makes up to 60 bytes+ to transmit only one character. And normally each tcp segment is acknowledged individually, so that makes a full roundtrip time to send a segment and get the acknowledge (just to be able to free the buffer resources and assume this character as transmitted)
So, finally, what's the limit? It depends on your application. If you can cope with the kernel resources available and don't need more buffers, you can pass without havin buffers in user space. If you need more, you'll need to implement buffers and be able to feed the kernel buffer with your buffer data when available.
Yes, a one byte send can - under very normal conditions - result in sending a TCP packet with only a single byte payload. Send coalescing in TCP is normally done by use of Nagle's algorithm. With Nagle's algorithm, sending data is delayed iff there is data that has already been sent but not yet acknowledged.
Conversely data will be sent immediately if there is no unacknowledged data. Which is usually true in the following situations:
The connection has just been opened
The connection has been idle for some time
The connection only received data but nothing was sent for some time
In that case the first send call that your application performs will cause a packet to be sent immediately, no matter how small. So starting communication with two or more small sends is usually a bad idea because it increases overhead and delay.
The infamous "send send recv" pattern can also cause really large delays (e.g. on Windows typically 200ms). This happens if the local TCP stack uses Nagle's algorithm (which will usually delay the second send) and the remote stack uses delayed acknowledgment (which can delay the acknowledgment of the first packet).
Since most TCP stack implementations use both, Nagle's algorithm and delayed acknowledgment, this pattern should best be avoided.

Linux kernel and realtek rtl8139 driver

I'm trying to write driver for rtl8139 for linux 2.6 from scratch. I've already written TX path, but I have some problems with RX.
I put RX into promiscous mode and receiving RX irqs. I set RBSTART into physical address of allocated memory by kmalloc.
I don't know how to find out how many received packets there are and how long they are.
I thought that ERBCR, CAPR, CBR registers tell it, but they are == 0.
Maybe I'm doing something wrong? How to find out anything about received packets?
I answer to my question myself.
The received packets are located starting at RBSTART. The first two bytes of rx-ed packet are status bytes, and the next 2 are length of the frame + 4 bytes of crc.
Maybye someone find this info helpful.
On receiving a packet, the data received from the line is stored in the receive FIFO. When Early Receive Threshold is met, the data is moved from FIFO to Recieve Buffer.
So, once you get an interrupt. You need to check the Interrupt Status Register for ROK. Then check the Early Rx status register which gives you the status of the packet received. If EROK is set, then check the Receive buffer status for ROK. Check for are any errors in the ISR and ERSR. Also check your Rx Configuration register for the threshold configuration for Rx FIFO, RX buf length.

Less throughput with small packets

I have a question why the throughput of my machine is very bad with a SMALL sized packet (i.e 64bytes) when compared with the packet sized 1500bytes?
I am having a GIGABIT NIC card and able to transmit at 80MB/s for 1500bytes sized packets but in the case 64bytes sized packet I can hardly make out around 25MB/s.
I know that in the case of 1500byte packets I need to send around 80k PPS to reach line rate and for 64bytes its around 1.4 million PPS.
But why there is a huge variation in throughput for small sized packets ??
EDIT: I am using memory mapping to transmit the packets from user-space to kernel-space in linux and then directly writing into the network driver to transmit. And I see my CPU utilization is very less and same when compared between 64bytes and 1500bytes packets.
But why there is a huge variation in throughput for small sized
packets ??
CPU strain. Independent of its size, each packet that gets out passes through a lot of processing before reaching the interface. Put another way, the "costs" of transmitting a small packet and a large packet are comparable.
If you're interested in this you might want to look into "GSO" and "UFO" in the Linux kernel - it was developed specifically for this.
It takes time to send packet headers. It takes time to setup DMA buffers, process packet headers, etc. All that extra work reduces the amount of actual payload that can be sent.
Think about this: each packet has its header contains the size of payload(data) and some general data. lets say the header are 16 bytes.
If you send 1000 packets of 64 bytes you send 1000 * (64 + 16) = 64000 + 16000 bytes.
If you send it in one shot it is only 64000+16 bytes.

UDP IP Fragmentation and MTU

I'm trying to understand some behavior I'm seeing in the context of sending UDP packets.
I have two little Java programs: one that transmits UDP packets, and the other that receives them. I'm running them locally on my network between two computers that are connected via a single switch.
The MTU setting (reported by /sbin/ifconfig) is 1500 on both network adapters.
If I send packets with a size < 1500, I receive them. Expected.
If I send packets with 1500 < size < 24258 I receive them. Expected. I have confirmed via wireshark that the IP layer is fragmenting them.
If I send packets with size > 24258, they are lost. Not Expected. When I run wireshark on the receiving side, I don't see any of these packets.
I was able to see similar behavior with ping -s.
ping -s 24258 hostA works but
ping -s 24259 hostA fails.
Does anyone understand what may be happening, or have ideas of what I should be looking for?
Both computers are running CentOS 5 64-bit. I'm using a 1.6 JDK, but I don't really think it's a programming problem, it's a networking or maybe OS problem.
Implementations of the IP protocol are not required to be capable of handling arbitrarily large packets. In theory, the maximum possible IP packet size is 65,535 octets, but the standard only requires that implementations support at least 576 octets.
It would appear that your host's implementation supports a maximum size much greater than 576, but still significantly smaller than the maximum theoretical size of 65,535. (I don't think the switch should be a problem, because it shouldn't need to do any defragmentation -- it's not even operating at the IP layer).
The IP standard further recommends that hosts not send packets larger than 576 bytes, unless they are certain that the receiving host can handle the larger packet size. You should maybe consider whether or not it would be better for your program to send a smaller packet size. 24,529 seems awfully large to me. I think there may be a possibility that a lot of hosts won't handle packets that large.
Note that these packet size limits are entirely separate from MTU (the maximum frame size supported by the data link layer protocol).
I found the following which may be of interest:
Determine the maximum size of a UDP datagram packet on Linux
Set the DF bit in the IP header and send continually larger packets to determine at what point a packet is fragmented as per Path MTU Discovery. Packet fragmentation should then result in a ICMP type 3 packet with code 4 indicating that the packet was too large to be sent without being fragmented.
Dan's answer is useful but note that after headers you're really limited to 65507 bytes.
