The message buffer struct(mbuf): Linux equivalent - linux

Is there an equivalent in Linux for the mbuf(message buffer)data structures that holds the actual packet data that is to be transmitted over networks? I assumed that this is a generic UNIX structure but apparently it's unique to FreeBSD.

There's the sk_buff, I don't know enough to say how much alike it is with mbuf in practise: Networking: sk_buff.

So it turns out that the mbuf(message buffer) and pbuf(packet buffer) structures are part of the FreeBSD network stack. The sk_buff(socket buffer) is the Linux equivalent of mbuf and contains all the information about the message data being transmitted as well as the packet structure.

Related

Receive TCP ACKs on application level [duplicate]

Linux has ioctl SIOCOUTQ described in man-page tcp(7) that returns amount of unsent data in socket buffers. If I understand kernel code right, all the non-ACKed data is counted as "unsent". The ioctl is available at least since 2.4.x.
Is there anything alike for {Free,Net,Open,*}BSD, Solaris, Windows?
There are (at least) two different pieces of information you might want: the amount of data that hasn't been sent yet, and the amount of data that's been sent-but-not-ACK-ed.
On Linux: SIOCOUTQ is documented to give the amount of unsent data, but actually gives the sum of (unsent data + sent-but-not-ACK-ed data). A recent patch (Feb 2016) made it possible to get the actual unsent data from the tcpi_notsent_bytes field in the TCP_INFO struct.
On macOS and iOS: getsockopt(fd, SOL_SOCKET, SO_NWRITE, ...) is just like SIOCOUTQ: it's documented to give the amount of unsent data, but actually gives the sum of (unsent data + sent-but-not-ACK-ed data). I don't know any way to get more fine-grained information.
On Windows: GetPerTcpConnectionEStats with the TcpConnectionEstatsSendBuff option gives you both unsent data and sent-but-not-ACK-ed data as two separate numbers.
I don't know how to get this information on other operating systems.
Since TCP/IP is implemented as a stream device, it might be possible to take a kernel dive and get the queue->q_count (number of bytes on the queue).

When working with raw sockets (Layer 2), does the kernel generate the frame check sequence (FCS), or do I need to generate it and append it myself? [duplicate]

As it is implied by this question, it seems that checksum is calculated and verified by ethernet hardware, so it seems highly unlikely that it must be generated by software when sending frames using an AF_PACKET socket, as seem here and here. Also, I don't think it can be received from the socket nor by any simple mean, since even Wireshark doesn't display it.
So, can anyone confirm this? Do I really need to send the checksum myself as shown in the last two links? Will checksum be created and checked automatically by the ethernet adaptor?
No, you do not need to include the CRC.
When using a packet socket in Linux using socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL) ), you must provide the layer 2 header when sending. This is defined by struct ether_header in netinet/if_ether.h and includes the destination host, source host, and type. The frame check sequence is not included, nor is the preamble, start of frame delimiter, or trailer. These are added by the hardware.
On Linux, if you mention socket(AF_PACKET, SOCK_RAW, htobe16(ETH_P_ALL)) similar case, you don't need to calculate ethernet checksum, NIC hardware/driver will do it for you. That means you need to offer whole data link layer frame except checksum before send it to raw socket.

What is the relationship between RX/TX ring and sk_buff?

I know that each NIC has its RX/TX ring in RAM for OS receiving/transmitting packets. And one item(packet descriptor) in the ring includes physical address of a packet, length of a packet and etc. I wonder that does this descriptor point to a sk_buff? And what happens if the packet is a GSO packet?Is this true that one descriptor in the ring = one packet = one sk_buff?
I wonder that does this descriptor point to a sk_buff?
Not exactly. sk_buff is a software construct, roughly, a data structure containing meta information to describe some chunk of network data AND point to the data itself. So, NIC descriptor doesn't need to point to sk_buff - it may only point to a data buffer (DMA/physical address is used).
And what happens if the packet is a GSO packet?
It's quite ambiguous question to answer since such offloads may be implemented in software (say, by the network stack) and may be done in hardware.
In the former case there is nothing to discuss in terms of NIC SW descriptors - the upper layer application provides a contiguous chunk of data, and the network stack produces smaller packets from it, so that sk_buff-s handed over to the network driver already describe small packets.
In the latter case (HW offload) the network driver is supplied with huge chunks of data (by means of handing over single sk_buff-s or sk_buff chains to it), and the network driver in turn posts appropriate descriptors to NIC - it may be one descriptor pointing to a big chunk of data, or a handful of descriptors pointing to smaller parts of the same contiguous data buffer - it doesn't matter a lot since the offload magic will take place in the HW - the overall data chunk will be sliced and packet headers will be prepended accordingly yielding many smaller network packets to be put on wire.
Is this true that one descriptor in the ring = one packet = one sk_buff?
Strictly speaking, no. It depends. Your network driver may be asked to transmit one sk_buff describing one data buffer. However, your driver under certain circumstances may decide to post multiple descriptors pointing to the same chunk of data but with different offsets - i.e. submission will be done in parts and there will be multiple descriptors in the NIC's ring related to a single sk_buff. Also, one packet is not always the same as one sk_buff - a packet may be presented as a handful of segments each described with a separate sk_buff forming an sk_buff chain (please find the next and prev fields in sk_buff).
The Linux kernel uses an sk_buff data structure to describe each
packet. When a packet arrives at the NIC, it invokes the DMA engine to
place the packet into the kernel memory via empty sk_buff's stored in
a ring buffer called rx_ring . An incoming packet is dropped if
the ring buffer is full. When a packet is processed at higher layers,
packet data remains in the same kernel memory, avoiding any extra
memory copies.
http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf
That last sentence seems to indicate that incoming packet data is kept in kernel memory in sk_buff structs without redundancy. So I'd say the answer to your question is yes, that descriptor would point to an sk_buff. And yes, each packet is put in it's own sk_buff in rx_ring.
sk_buff has nothing to do with physical network interfaces (not directly, at least). sk_buff lists store data as seen by the socket accessing software and kernel protocol handlers (which manipulate those lists to add/remove headers and/or alter data, e.g. when encryption is employed).
It is a responsibility of a low level driver to translate sk_buff list contents into something physical network adapter will understand. In particular, network hardware can be really dumb (like when doing networking over serial lines), in which case the driver will basically read sk_buff lists byte by byte and send those over the wire.
The more advanced adapters are usually capable of doing scatter/gather DMA - given a list of addresses in RAM they will be able to access each address and either obtain a packet data from there or put a received data back. However the exact details of this mechanism are very much adapter specific and in many cases are not even consistent between single vendor's products.
I wonder that does this descriptor point to a sk_buff?
The answer is YES. This avoids copying memory from one place (the rx_ring dma buffer) to another (the sk_buff).
you can check the implementation of the b44 NIC driver (in drivers/net/ethernet/broadcom/b44.c), the function b44_init_rings pre-allocates a constant number of sk_buff for the rx_ring, which are also used as DMA buffers for the NIC.
static void b44_init_rings(struct b44 *bp)
{
int i;
b44_free_rings(bp);
memset(bp->rx_ring, 0, B44_RX_RING_BYTES);
memset(bp->tx_ring, 0, B44_TX_RING_BYTES);
if (bp->flags & B44_FLAG_RX_RING_HACK)
dma_sync_single_for_device(bp->sdev->dma_dev, bp->rx_ring_dma,
DMA_TABLE_BYTES, DMA_BIDIRECTIONAL);
if (bp->flags & B44_FLAG_TX_RING_HACK)
dma_sync_single_for_device(bp->sdev->dma_dev, bp->tx_ring_dma,
DMA_TABLE_BYTES, DMA_TO_DEVICE);
for (i = 0; i < bp->rx_pending; i++) {
if (b44_alloc_rx_skb(bp, -1, i) < 0)
break;
}
}

Is ethernet checksum exposed via AF_PACKET?

As it is implied by this question, it seems that checksum is calculated and verified by ethernet hardware, so it seems highly unlikely that it must be generated by software when sending frames using an AF_PACKET socket, as seem here and here. Also, I don't think it can be received from the socket nor by any simple mean, since even Wireshark doesn't display it.
So, can anyone confirm this? Do I really need to send the checksum myself as shown in the last two links? Will checksum be created and checked automatically by the ethernet adaptor?
No, you do not need to include the CRC.
When using a packet socket in Linux using socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL) ), you must provide the layer 2 header when sending. This is defined by struct ether_header in netinet/if_ether.h and includes the destination host, source host, and type. The frame check sequence is not included, nor is the preamble, start of frame delimiter, or trailer. These are added by the hardware.
On Linux, if you mention socket(AF_PACKET, SOCK_RAW, htobe16(ETH_P_ALL)) similar case, you don't need to calculate ethernet checksum, NIC hardware/driver will do it for you. That means you need to offer whole data link layer frame except checksum before send it to raw socket.

How do I get amount of queued data for UDP socket?

To see how well I'm doing in processing incoming data, I'd like to measure the queue length at my TCP and UDP sockets.
I know that I can get the queue size via SO_RCVBUF socket option, and that ioctl(<sockfd>, SIOCINQ, &<some_int>) tells me the information for TCP sockets. But for UDP the SIOCINQ/FIONREAD ioctl returns only the size of next pending datagram. Is there a way how to get queue size for UDP, without having to parse system tables such as /proc/net/udp?
FWIW, I did some experiments to map out the behavior of FIONREAD on different platforms.
Platforms where FIONREAD returns all the data pending in a SOCK_DGRAM socket:
Mac OS X, NetBSD, FreeBSD, Solaris, HP-UX, AIX, Windows
Platforms where FIONREAD returns only the bytes for the first pending datagram:
Linux
It might also be worth noting that some implementations include headers or other overhead bytes in the count, while others only count the payload bytes. Linux appears to return the payload size, not including IP headers.
As ldx mentioned, it is not supported through ioctl or getsockopt.
It seems to me that the current implementation of SIOCINQ was aimed to determine how much buffer is needed to read the entire waiting buffer (but I guess it is not so useful for that, as it can change between the read of it to the actual buffer read).
There are many other telemetries which are not supported though such system calls, I guess there is no real need in normal production usage.
You can check the drops/errors through "netstat -su" , or better using SNMP (udpInErrors) if you just want to monitor the machine state.
BTW: You always have the option to hack in the Kernel code and add this value (or others).

Resources