Use of Recv-Q and Send-Q - linux

What is the use of the Recv-Q and Send-Q columns in netstat's output? How do I use use this in a realistic scenario?
On my system, both of the columns are always shown as zero. What is the meaning for that?

From my man page:
Recv-Q
Established: The count of bytes not copied by the user program
connected to this socket.
Listening: Since Kernel 2.6.18 this column contains the current syn
backlog.
Send-Q
Established: The count of bytes not acknowledged by the remote
host.
Listening: Since Kernel 2.6.18 this column contains the maximum size
of the syn backlog.
If you have this stuck to 0, this just mean that your applications, on both side of the connection, and the network between them, are doing OK. Actual instant values may be different from 0, but in such a transient, fugitive manner that you don't get a chance to actually observe it.
Example of real-life scenario where this might be different from 0 (on established connections, but I think you'll get the idea):
I recently worked on a Linux embedded device talking to a (poorly designed) third party device. On this third party device, the application clearly got stuck sometimes, not reading the data it received on TCP connection, resulting in its TCP window going down to 0 and staying stuck there for tens of seconds (phenomenon observed via wireshark on a mirrored port between the 2 hosts). In such case:
Recv-Q: running netstat on the third party device (which I had no
mean to do) may have show an increasing Recv-Q, up to some roof value
where the other side (me) stop sending data because the window get
down to 0, since the application does not read the data available on
its socket, and these data stay buffered in the TCP implementation in
the OS, not going to the stuck application -> from the receiver
side, application issue.
Send-Q: running netstat on my side (which I did not tried because
1/ the problem was clear from wireshark and was the first case above
and 2/ this was not 100% reproducible) may have show a non-zero
Send-Q, if the other side TCP implementation at the OS level have
been stucked and stopped ACKnowleding my data -> from the sender
side, receiving TCP implementation (typically at the OS level) issue.
Note that the Send-Q scenario depicted above may also be a sending side issue (my side) if my Linux TCP implementation was misbehaving and continued to send data after the TCP window went down to 0: the receiving side then has no more room for this data -> does not ACKnowledge.
Note also that the Send-Q issue may be caused not because of the receiver, but by some routing issue somewhere between the sender and the receiver. Some packets are "on the fly" between the 2 hosts, but not ACKnowledge yet. On the other hand, the Recv-Q issue is definitly on a host: packets received, ACKnowledged, but not read from the application yet.
EDIT:
In real life, with non-crappy hosts and applications as you can reasonably expect, I'd bet the Send-Q issue to be caused most of the time by some routing issue/network poor performances between the sending and receiving side. The "on the fly" state of packets should never be forgotten:
The packet may be on the network between the sender and the receiver,
(or received but ACK not send yet, see above)
or the ACK may be on the network between the receiver and the sender.
It takes a RTT (round time trip) for a packet to be send and then ACKed.

the accepted answer #jbm.
Listening: Since Kernel 2.6.18 this column contains the current syn
backlog. Listening: Since Kernel 2.6.18 this column contains the
maximum size of the syn backlog.
they are not syn backlog, they are listen backlog.

Related

ArangoDB Could not connect

arangod is running for some time without any problems, but at some point no more connections can be made.
aranogsh then shows the following error message:
Error message 'Could not connect to 'tcp://127.0.0.1:8529' 'connect() failed with #99 - Cannot assign requested address''
In the log file arangod still writes more trace information.
After restarting aranogd it is running without problems again, until the problem suddenly reoccurs.
Why is this happening?
Since this question was sort of answered by time, I'll use this answer to elaborate howto dig into such a situation and to get a valuable analysis on which operating system parameters to look. I'll base this on linux targets.
First we need to find out whats currently going on using the netstat tool as a root user (we care for tcp ports only):
netstat -alnpt
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
...
tcp 0 0 0.0.0.0:8529 0.0.0.0:* LISTEN 3478/arangod
tcp 0 0 127.0.0.1:45218 127.0.0.1:8529 ESTABLISHED 6902/arangosh
tcp 1 0 127.0.0.1:46985 127.0.0.1:8529 CLOSE_WAIT 485/arangosh
We see an overview of the 3 possible value groups:
LISTEN: These are daemon processes offering tcp services to remote ends, in this case the arangod process with its server socket. It binds port 8529 on all available ipv4 addresses of the system (0.0.0.0) and accepts connections from any remote location (0.0.0.0:*)
ESTABLISHED: this is an active tcp connection in this case between arangosh and arangod; Arangosh has its client port (45218) in the higher range connecting arangod on port 8529.
CLOSE_WAIT: this is a connection in termination state. Its normal to have them. The TCP stack of the operating system keeps them around for a while to have a knowledge where to sort in stray TCP-packages that may have been sent, but did not arive on time.
As you see TCP ports are 16 bits unsigned integers, ranging from 0 to 65535. Server sockets start from the lower end, and most operating systems require processes to be running as root to bind ports below 1024. Client sockets start from the upper end and range down to a specified limit on the client. Since multiple clients can connect one server, while the server port range seems narrow, its usually the client side ports that wear out. If the client frequently closes and reopens the connection you may see many sockets in CLOSE_WAIT state, as many discussions across the net hint, these are the symptoms of your system eventually running out of resources. In general the solution to this problem is to to re-use existing connections through the keepalive feature.
As the solaris ndd command explains thoroughly which parameters it may modify with which consequences in the solaris kernel, the terms explained there are rather generic to tcp sockets, and may be found on many other operating systems in other ways; in linux - which we focus on here - through the /proc/sys/net-filesystem.
Some valuable switches there are:
ipv4/ip_local_port_range This is the range for the local sockets. You can try to narrow it, and use arangob --keep-alive false to explore whats happening if your system runs out of these.
time wait (often shorted to tw) is the section that controls what the TCP-Stack should do with already closed sockets in CLOSE_WAIT state. The Linux kernel can do a trick here - it can instantly re-use connections in that state for new connections. Vincent Bernat explains very nicely which screws to turn and what the differnt parameters in the kernel mean.
So once you decided to change some of your values in /proc so your host better scales to the given situation, you need to make them reboot persistant - since /proc is volatile and won't remember values across reboots.
Most linux systems therefore offer the /etc/sysctl.[d|conf] file; It maps slashes in the proc filesystem to dots, so /proc/sys/net/ipv4/tcp_tw_reuse will translate into net.ipv4.tcp_tw_reuse.

Server reports a ZeroWindow but there's no data in the receive buffer

I have a LINUX based server application that reports a TCP ZeroWindow on a socket connection to indicate that it is closing the receiving window. This was confirmed with Wireshark and it should also be noted that window scaling is disabled.
The weird thing is that when looking at this connection with netstat, the connection shows the following:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 192.168.0.201:1344 192.168.0.101:35340 ESTABLISHED
The reason why I find this weird, is that the Recv-Q reports a value of 0, which means that it can receive more data, but yet the connection still reports a ZeroWindow indicating to the client not to send anything.
Because of this, the connection gets stuck in this state where no data flows anymore except the client that regularly sends special probe segments to the server. The purpose of these probes is to prompt the server to send back a segment containing the current window size; however, the window never reopens.
With this said, I was looking for an explanation of how the connection could get stuck in this state: ZeroWindow with Recv-Q also zero.
Thanks!
Possibly the server has decided that it is finished reading data and has called shutdown(., SHUT_RD) on the socket. This should not send an RST yet to the client, as the server is still allowed to send data. The linux implementation might set the window size to 0 after the shutdown(), as the process has indicated it won't read any more data anyway (just an assumption, i never tries if linux really behaves that way). If you can reproduce the issue, try straceing the server process before the connection enters this state.

how is TCP's checksum calculated when we use tcpdump to capture packets which we send out

I am trying to generate a series of packets to simulate the TCP 3-way handshake procedure, my first step is to capture the real connecting packets, and try to re-send the same packets from the same machine, but it didn't work at first.
finally I found it out that the packet I captured with tcpdump is not exactly what my computer sent out, the TCP's checksum field is changed and it lead me to thinkk that I can establish a tcp connection even the TCP checksum is incorrect.
so my question is how is the checksum field calculated? is it modified by tcpdump or hardware? why is it changed? Is it a bug of tcpdump? or it's because the calculation is omitted.
the following is the screenshot I captured from my host machine and a virtual machinne, you can see that the same packet captured on differnet machine are all the same except for the TCP checksum.
and the small window is my virtual machine, I used command "ssh 10.82.25.138" from the host to generate these packets
What you are seeing may be the result of checksum offloading. To quote from the wireshark wiki (http://wiki.wireshark.org/CaptureSetup/Offloading):
Most modern operating systems support some form of network offloading,
where some network processing happens on the NIC instead of the CPU.
Normally this is a great thing. It can free up resources on the rest
of the system and let it handle more connections. If you're trying to
capture traffic it can result in false errors and strange or even
missing traffic.
On systems that support checksum offloading, IP, TCP, and UDP
checksums are calculated on the NIC just before they're transmitted on
the wire. In Wireshark these show up as outgoing packets marked black
with red Text and the note [incorrect, should be xxxx (maybe caused by
"TCP checksum offload"?)].
Wireshark captures packets before they are sent to the network
adapter. It won't see the correct checksum because it has not been
calculated yet. Even worse, most OSes don't bother initialize this
data so you're probably seeing little chunks of memory that you
shouldn't.
Although this is for wireshark, the same principle applies. In your host machine, you see the wrong checksum because it just hasn't been filled in yet. It looks right on the guest, because before it's sent out on the "wire" it is filled in. Try disabling checksum offloading on the interface which is handling this traffic, e.g.:
ethtool -K eth0 rx off tx off
if it's eth0.

Dropping of connections with tcp_tw_recycle

summary of the problem
we are having a setup wherein a lot(800 to 2400 per second( of incoming connections to a linux box and we have a NAT device between the client and server.
so there are so many TIME_WAIT sockets left in the system.
To overcome that we had set tcp_tw_recycle to 1, but that led to drop of in comming connections.
after browsing through the net we did find the references for why the dropping of frames with tcp_tw_recycle and NAT device happens.
resolution tried
we then tried by setting tcp_tw_reuse to 1 it worked fine without any issues with the same setup and configuration.
But the documentation says that tcp_tw_recycle and tcp_tw_reuse should not be used when the Connections that go through TCP state aware nodes, such as firewalls, NAT devices or load balancers may see dropped frames. The more connections there are, the more likely you will see this issue.
Queries
1) can tcp_tw_reuse be used in this type of scenarios?
2) if not, which part of the linux code is preventing tcp_tw_reuse being used for such scenario?
3) generally what is the difference between tcp_tw_recycle and tcp_tw_reuse?
By default, when both tcp_tw_reuse and tcp_tw_recycle are disabled, the kernel will make sure that sockets in TIME_WAIT state will remain in that state long enough -- long enough to be sure that packets belonging to future connections will not be mistaken for late packets of the old connection.
When you enable tcp_tw_reuse, sockets in TIME_WAIT state can be used before they expire, and the kernel will try to make sure that there is no collision regarding TCP sequence numbers. If you enable tcp_timestamps (a.k.a. PAWS, for Protection Against Wrapped Sequence Numbers), it will make sure that those collisions cannot happen. However, you need TCP timestamps to be enabled on both ends (at least, that's my understanding). See the definition of tcp_twsk_unique for the gory details.
When you enable tcp_tw_recycle, the kernel becomes much more aggressive, and will make assumptions on the timestamps used by remote hosts. It will track the last timestamp used by each remote host having a connection in TIME_WAIT state), and allow to re-use a socket if the timestamp has correctly increased. However, if the timestamp used by the host changes (i.e. warps back in time), the SYN packet will be silently dropped, and the connection won't establish (you will see an error similar to "connect timeout"). If you want to dive into kernel code, the definition of tcp_timewait_state_process might be a good starting point.
Now, timestamps should never go back in time; unless:
the host is rebooted (but then, by the time it comes back up, TIME_WAIT socket will probably have expired, so it will be a non issue);
the IP address is quickly reused by something else (TIME_WAIT connections will stay a bit, but other connections will probably be struck by TCP RST and that will free up some space);
network address translation (or a smarty-pants firewall) is involved in the middle of the connection.
In the latter case, you can have multiple hosts behind the same IP address, and therefore, different sequences of timestamps (or, said timestamps are randomized at each connection by the firewall). In that case, some hosts will be randomly unable to connect, because they are mapped to a port for which the TIME_WAIT bucket of the server has a newer timestamp. That's why the docs tell you that "NAT devices or load balancers may start drop frames because of the setting".
Some people recommend to leave tcp_tw_recycle alone, but enable tcp_tw_reuse and lower tcp_fin_timeout. I concur :-)

What is the meaning of SO_REUSEADDR (setsockopt option) - Linux? [duplicate]

This question already has answers here:
How do SO_REUSEADDR and SO_REUSEPORT differ?
(2 answers)
Closed 9 years ago.
From the man page:
SO_REUSEADDR Specifies that the rules
used in validating addresses supplied
to bind() should allow reuse of local
addresses, if this is supported by the
protocol. This option takes an int
value. This is a Boolean option
When should I use it? Why does "reuse of local addresses" give?
TCP's primary design goal is to allow reliable data communication in the face of packet loss, packet reordering, and — key, here — packet duplication.
It's fairly obvious how a TCP/IP network stack deals with all this while the connection is up, but there's an edge case that happens just after the connection closes. What happens if a packet sent right at the end of the conversation is duplicated and delayed, such that the 4-way shutdown packets get to the receiver before the delayed packet? The stack dutifully closes down its connection. Then later, the delayed duplicate packet shows up. What should the stack do?
More importantly, what should it do if a program with open sockets on a given IP address + TCP port combo closes its sockets, and then a brief time later, a program comes along and wants to listen on that same IP address and TCP port number? (Typical case: A program is killed and is quickly restarted.)
There are a couple of choices:
Disallow reuse of that IP/port combo for at least 2 times the maximum time a packet could be in flight. In TCP, this is usually called the 2×MSL delay. You sometimes also see 2×RTT, which is roughly equivalent.
This is the default behavior of all common TCP/IP stacks. 2×MSL is typically between 30 and 120 seconds, and it shows up in netstat output as the TIME_WAIT period. After that time, the stack assumes that any rogue packets have been dropped en route due to expired TTLs, so that socket leaves the TIME_WAIT state, allowing that IP/port combo to be reused.
Allow the new program to re-bind to that IP/port combo. In stacks with BSD sockets interfaces — essentially all Unixes and Unix-like systems, plus Windows via Winsock — you have to ask for this behavior by setting the SO_REUSEADDR option via setsockopt() before you call bind().
SO_REUSEADDR is most commonly set in network server programs, since a common usage pattern is to make a configuration change, then be required to restart that program to make the change take effect. Without SO_REUSEADDR, the bind() call in the restarted program's new instance will fail if there were connections open to the previous instance when you killed it. Those connections will hold the TCP port in the TIME_WAIT state for 30-120 seconds, so you fall into case 1 above.
The risk in setting SO_REUSEADDR is that it creates an ambiguity: the metadata in a TCP packet's headers isn't sufficiently unique that the stack can reliably tell whether the packet is stale and so should be dropped rather than be delivered to the new listener's socket because it was clearly intended for a now-dead listener.
If you don't see that that is true, here's all the listening machine's TCP/IP stack has to work with per-connection to make that decision:
Local IP: Not unique per-conn. In fact, our problem definition here says we're reusing the local IP, on purpose.
Local TCP port: Ditto.
Remote IP: The machine causing the ambiguity could re-connect, so that doesn't help disambiguate the packet's proper destination.
Remote port: In well-behaved network stacks, the remote port of an outgoing connection isn't reused quickly, but it's only 16 bits, so you've got 30-120 seconds to force the stack to get through a few tens of thousands of choices and reuse the port. Computers could do work that fast back in the 1960s.
If your answer to that is that the remote stack should do something like TIME_WAIT on its side to disallow ephemeral TCP port reuse, that solution assumes that the remote host is benign. A malicious actor is free to reuse that remote port.
I suppose the listener's stack could choose to strictly disallow connections from the TCP 4-tuple only, so that during the TIME_WAIT state a given remote host is prevented from reconnecting with the same remote ephemeral port, but I'm not aware of any TCP stack with that particular refinement.
Local and remote TCP sequence numbers: These are also not sufficiently unique that a new remote program couldn't come up with the same values.
If we were re-designing TCP today, I think we'd integrate TLS or something like it as a non-optional feature, one effect of which is to make this sort of inadvertent and malicious connection hijacking impossible. That requires adding large fields (128 bits and up) which wasn't at all practical back in 1981, when the document for the current version of TCP (RFC 793) was published.
Without such hardening, the ambiguity created by allowing re-binding during TIME_WAIT means you can either a) have stale data intended for the old listener be misdelivered to a socket belonging to the new listener, thereby either breaking the listener's protocol or incorrectly injecting stale data into the connection; or b) new data for the new listener's socket mistakenly assigned to the old listener's socket and thus inadvertently dropped.
The safe thing to do is wait out the TIME_WAIT period.
Ultimately, it comes down to a choice of costs: wait out the TIME_WAIT period or take on the risk of unwanted data loss or inadvertent data injection.
Many server programs take this risk, deciding that it's better to get the server back up immediately so as to not miss any more incoming connections than necessary.
This is not a universal choice. Many programs — even server programs requiring a restart to apply a settings change — choose instead to leave SO_REUSEADDR alone. The programmer may know these risks and is choosing to leave the default alone, or they may be ignorant of the issues but are getting the benefit of a wise default.
Some network programs offer the user a choice among the configuration options, fobbing the responsibility off on the end user or sysadmin.
SO_REUSEADDR allows your server to
bind to an address which is in a
TIME_WAIT state.
This socket option tells the kernel that even if this port is busy (in the TIME_WAIT state), go ahead and reuse it anyway. If it is busy, but with another state, you will still get an address already in use error. It is useful if your server has been shut down, and then restarted right away while sockets are still active on its port.
From unixguide.net
When you create a socket, you don't really own it. The OS (TCP stack) creates it for you and gives you a handle (file descriptor) to access it. When your socket is closed, it take time for the OS to "fully close it" while it goes through several states. As EJP mentioned in the comments, the longest delay is usually from the TIME_WAIT state. This extra delay is required to handle edge cases at the very end of the termination sequence and make sure the last termination acknowledgement either got through or had the other side reset itself because of a timeout. Here you can find some extra considerations about this state. The main considerations are pointed out as follow :
Remember that TCP guarantees all data transmitted will be delivered,
if at all possible. When you close a socket, the server goes into a
TIME_WAIT state, just to be really really sure that all the data has
gone through. When a socket is closed, both sides agree by sending
messages to each other that they will send no more data. This, it
seemed to me was good enough, and after the handshaking is done, the
socket should be closed. The problem is two-fold. First, there is no
way to be sure that the last ack was communicated successfully.
Second, there may be "wandering duplicates" left on the net that must
be dealt with if they are delivered.
If you try to create multiple sockets with the same ip:port pair really quick, you get the "address already in use" error because the earlier socket will not have been fully released. Using SO_REUSEADDR will get rid of this error as it will override checks for any previous instance.

Resources