Discard incoming UDP packet without reading - linux

In some cases, I'd like to explicitly discard packets waiting on the socket with as little overhead as possible. It seems there's no explicit "drop udp buffer" system call, but maybe I'm wrong?
The next best way would be probably to recv the packet to a temporary buffer and just drop it. It seems I can't receive 0 bytes, since man says about recv: The return value will be 0 when the peer has performed an orderly shutdown. So 1 is the minimum in this case.
Is there any other way to handle this?
Just in case - this is not a premature optimisation. The only thing this server is doing is forwarding / dispatching the UDP packets in a specific way - although recv with len=1 won't kill me, I'd rather just discard the whole queue in one go with some more specific function (hopefully lowering the latency).

You can have the kernel discard your UDP packets by setting the UDP receive buffer to 0.
int UdpBufSize = 0;
socklen_t optlen = sizeof(UdpBufSize);
setsockopt(socket, SOL_SOCKET, SO_RCVBUF, &UdpBufSize, optlen);
Whenever you see fit to receive packets, you can then set the buffer to, for example, 4096 bytes.

I'd rather just discard the whole queue in one go
Since this is UDP we are talking here: close(udp_server_socket) and socket()/bind() again?
To my understanding should work.

man says about recv: the return value will be 0 when the peer has performed an orderly shutdown.
That doesn't apply to UDP. There is no "connection" to shut down in UDP. A return value of 0 is perfectly valid, it just means datagram with no payload was received (i.e. the IP and UDP headers only.)
Not sure that helps your problem or not. I really don't understand where you are going with the len=1 stuff.

Related

When using recv(n), with n greather than the MTU are you guaranteed to read at least a whole layer 2 frame?

I was wondering, imagine if there is no data to read from a TCP socket, then a whole frame of 1492 bytes arrives (full). In your code (C or any language supporting TCP) you have let's say recv 4096 bytes, will the OS guarantee that the recv reads the whole 1492 bytes, or is it possible that the loading of the frame in memory and recv are "interleaved", so the recv may get less ?
TCP is a stream oriented protocol. Data are received in order but you must not do any assumption about how many times you have to call recv until you receive all your data.
It is up to your application to repeat the calls to recv until you know you have received what you need.
(1) TCP is stream-oriented protocol. This means that it accepts a stream of data from the upper layer on the sender and returns the stream of data to the upper layer on the receiver. TCP itself receives packets from IP layer, and then reconstructs the stream. That is at some points packets cease to exist. In theory it is possible that somewhere during this reconstructed stream, only half of the incomming packet is copied in buffer, but it seems to me pretty unlikely that this would happen.
Now, linux man page states
The receive calls normally return any data available up to the requested amount,
I would interpret it as "if one packet has arrived (correctly, in order, etc), you will get the whole packet worth of data". But there is no guarantee.
On the other hand Windows docs states:
recv will return as much data as is currently available—up to the size of the buffer specified.
Which sounds more like the guarantee.
Note, however, that the data will only be returned if the packet is received correctly, and it is next in-order packet (with next expected sequence numbers).
(2) Now, TCP layer works on complete packets. It is actually impossible for it to do interleaving or anything. Ethernet has a checksum, which cannot be computed unless the packet was received completely. Packets with incorrect Ethernet checksum should be filtered out by the network card. TCP also has a checksum which requires all packet data to compute. So, if the network card has passed the packet to your OS, then data should be available.
(3) I don't think you can assume that if the packet is received, it is immediatelly available. A pretty common feature of network cards is TCP segmentation offload, which reconstructs part of the stream and results in network card passing one TCP packet that was reconstructed from multiple TCP packets. There are other things that can be in place to reduce the number of interrupts, which more or less result in several packets comming at once. So, the more likely situation is that you will have maybe some delay and then receive data from several packets at once.
The point is, the opposite of what you described is likely to happen. However, I still would not write an application that makes any assumptions about how large a chunk of data is available at a time. This negates the concept of a stream.

UDP sockets and MSG_PEEK

When we use recvfrom() to read a packet from UDP socket, we cannot read it partially. Because if we read a small part of UDP packet first(by passing a small buffer), the reminder of packet is dropped as mentioned here:
All receive operations return only one packet. When the packet is
smaller than the passed buffer, only that much data is returned; when
it is bigger, the packet is truncated and the MSG_TRUNC flag is set.
But I wonder if the same thing happens if we just inspect packet using MSG_PEEK flag. Will reminder of packet is dropped if I just peek the UDP message?
Will reminder of packet is dropped if I just peek the UDP message?
Nothing will be dropped, since with the MSG_PEEK flag set, the state of the socket's incoming-data-buffer is not modified; the entire packet will remain in the socket's buffer.
Only the first part of the packet's data will be copied into your too-small destination-data-buffer, of course.

nonblocking send()/write() and pending data dealing

when the send(or write) buffer is going to be full, let me say, only theres is only 500 bytes space. if I have a NONBLOCKING fd, and do
n = send(fd, buf, 1000,0)
here I wll get n<0, and I can get EWOULDBLOCK or EAGAIN error. my questions are:
1 here, the send write 500 bytes into the send buffer or 0 bytes to the send buffer?
2 if 500 bytes are sent to the buffer and if the fd is a UDP socket, then the datagram is split into 2 parts?
3 I need to use the fd to send many datagrams, if this time the send buffer is full(if there is EWOULDBLOCK or EAGAIN error), I need to make a pending list of datagrams(a FIFO queue). And everytime I want to send some datagram, I will have to check the pending list to see whether it is empty or not. if it is not empty, send the datagram in the pending list first. It seems to me that this design is a bit troublesome. And the design is similar to extending the kernel(BTW, is it in the kernel?) send buffer by a userspace pending list. Is there better solution for this?
thanks!
The below only applies to UDP (and only in linux, though others are similar), but that seems to be what you're asking about.
Setting non-blocking mode on a UDP socket is completely irrelevant (for sending) as a send will never block -- it immediately sends the packet, without any buffering.
It IS possible (if the machine is very busy) for there to be a buffer space problem (run out of transient packet buffers for packet processing), but in that case call will return ENOBUFS, regardless of whether the fd is blocking or non-blocking. This should be extremely rare.
There's a potential problem if you're generating packets faster than the network can take them (fairly easy to do on a fast machine and a 10Mbit ethernet port), in which case the kernel will start dropping the outgoing packets. Unfortunately there's no easy way to detect when this happens (you can check the interface for TX dropped packets, but that won't tell you which packets were dropped).
Its also possible to have a problem if you use the UDP_CORK socket option, which buffers data written to the socket instead of sending a packet, and only sends a single packet when the CORK option is unset. In this case, if the buffer grows too big you'll get EMSGSIZE (and again, the NONBLOCKING setting is irrelevant).
If you are talking about UDP, you are completely off point here - for UDP the value of the SO_SNDBUF socket option puts a limit on the size of a datagram you can send. In other words, there's no real per-socket send buffer (though data is still queued in the kernel to be sent out by appropriate network controller). You would get EMSGSIZE if you try to send(2) more in one shot.
For TCP though, you would only get EWOULDBLOCK when there's no space in the send buffer at all, i.e. no data has been copied from user to the kernel. Otherwise sent(2)'s return value tells you exactly how many bytes have been copyed.

sendto on Tru64 is returning ENOBUF

I am currently running an old system on Tru64 which involves lots of UDP sockets using the sendto() function. The sockets are used in our code to send messages to/from various processes and then eventually on to a thick client app that is connected remotely. Occasionally the socket to the thick client gets stuck, this can cause some of these messages to get built up. My question is how can I determine the current buffer size, and how do I determine the maximum message buffer. The code below gives a snippet of how I set up the port and use the sendto function.
/* need to adjust the maximum size we can send on this */
/* as it needs to be able to cope with the biggest */
/* messages we send */
lenlen = sizeof(len) ;
/* allow double for when the system is under load */
int lenlen, len ;
lenlen = sizeof(len) ;
len = 2 * 32000;
msg_socket = socket( AF_UNIX,SOCK_DGRAM, 0);
result = setsockopt(msg_socket, SOL_SOCKET, SO_SNDBUF, (char *)&len, lenlen) ;
result = sendto( msg_socket,
(char *)message,
(int)message_len,
flags,
dest_addr,
addrlen);
Note. We have ported this application to Linux and the problem does not seem to appear there.
Any help would be greatly appreciated.
Regards
UDP send buffer size is different from TCP - it just limits the size of the datagram. Quoting Stevens UNP Vol. 1:
...
A UDP socket has a send buffer size (which we can change with SO_SNDBUF socket option, Section 7.5), but this is simply an upper limit on the maximum-sized UDP datagram that can be written to the socket. If an application writes a datagram larger than the socket send buffer size, EMSGSIZE is returned. Since UDP is unreliable, it does not need to keep a copy of the application's data and does not need an actual send buffer. (The application data is normally copied into a kernel buffer of some form as it passes down the protocol stack, but this copy is discarded by the datalink layer after the data is transmitted.)
UDP simply prepends 8-byte header and passes the datagram to IP. IPv4 or IPv6 prepends its header, determines the outgoing interface by performing the routing function, and then either adds the datagram to the datalink output queue (if it fits within the MTU) or fragments the datagram and adds each fragment to the datalink output queue. If a UDO application sends large datagrams (say 2,000-byte datagrams), there's a much higher probability of fragmentation than with TCP. because TCP breaks the application data into MSS-sized chunks, something that has no counterpart in UDP.
The successful return from write to a UDP socket tells us that either the datagram or all fragments of the datagram have been added to the datalink output queue. If there is no room on the queue for the datagram or one of its fragments, ENOBUFS is often returned to the application.
Unfortunately, some implementations do not return this error, giving the application no indication that the datagram was discarded without even being transmitted.
The last footnote needs attention - but it looks like Tru64 has this error code listed in the manual page.
The proper way of doing it though is to queue your outstanding messages in the application itself and to carefully check return values and the errno after each system call. This still does not guarantee delivery (since UDP receivers might drop the packets without any notice to the senders). Check the UDP packet discard counters with netstat -s on both/all sides, see if they are growing. There is really no way around this besides switching to TCP or implementing your own timeout/ack and re-transmission logic.
You should probably be using some sort of congestion control to avoid overloading the network. By far the easiest way to do this is to use TCP instead of UDP.
It fails less often on Linux because UDP sockets wait for space in the local network interface queue on Linux (unless you set them non-blocking). However, with any operating system, if the overfull queue is not in the local system, the packet will be dropped silently.

Disable TCP Delayed ACKs

I have an application that receives relatively sparse traffic over TCP with no application-level responses. I believe the TCP stack is sending delayed ACKs (based on glancing at a network packet capture). What is the recommended way to disable delayed-ACK in the network stack for a single socket? I've looked at TCP_QUICKACK, but it seems that the stack will change it under my feet anyways.
This is running on a Linux 2.6 kernel, and I am not worried about portability.
You could setsockopt(sockfd, IPPROTO_TCP, TCP_QUICKACK, (int[]){1}, sizeof(int)) after every recv you perform. It appears that TCP_QUICKACK is only reset when there is data being sent or received; if you're not sending any data, then it will only get reset when you receive data, in which case you can simply set it again.
You can check this in the 14th field of /proc/net/tcp; if it is not 1, ACKs should be sent immediately... if I'm reading the TCP code correctly. (I'm not an expert at this either.)
I believe using the setsockopt() function you can use the TCP_NODELAY which will disable the Nagle algorithm.
Edit Found a link: http://www.ibm.com/developerworks/linux/library/l-hisock.html
Edit 2 Tom is correct. Nagle does not affect Delayed ACKs.
The accepted answer has a bug, it should be
flags = 0;
flglen = sizeof(flags);
setsockopt(sfd, SOL_TCP, TCP_QUICKACK, &flags, flglen)
(SOL_TCP, not IPPROTO_TCP)
Post here in case someone need it.
As #damolp suggested, it is not a bug. Keep it here for record.

Resources