Select system call hangs indefinitely in a n/w application. - linux

We have a networking application, it will be used inside various scripts to communicate with other systems.
Occasionally the scripts hang on a call to our networking application. We recently experienced a hang, and I tried to debug the hung process of this particular application.
This application consists of a client and a server(a daemon), the hang occurs on client side.
Strace output showed me that it's hung on a select system call.
> strace -p 34567
select(4, [3], NULL, NULL, NULL
As you can see there's no timeout given on select call, it can block indefinitely if the file descriptor '3' is not ready for reading.
lsof output showed that fd '3' is in FIN_WAIT2 state.
> lsof -p 34567
client 34567 user 3u IPv4 55184032 TCP client-box:smar-se-port2->server:daemon (FIN_WAIT2)
Does the above information imply something? FIN_WAIT2 state? I checked on the server side(where corresponding daemon process should be running), but there are no daemon processes running on server side. My guess is the daemon ran successfully and sent the output to client, which should be available on fd '3' for reading, but the select() call on client never comes out, and still waits for something to happen!
I am not sure why it never comes out of select() call, this only happens occasionally, most of the times the application just works fine.
Any clues?
Both Server and client are SuSE Linux.

FIN_WAIT2 means your app has sent a FIN packet to the peer, but has not received a FIN from the peer yet. In TCP, a graceful close requires a FIN from both parties. The fact that the server daemon is not running means the daemon exited (or was killed) without notifying its peer (you). So your select() is waiting for packets it will no longer receive, and has to wait for the OS to invalidate the socket using an internal timeout, which can take a long time. This is the kind of situation why you should never use infinite timeouts. Use an appropriate timeout and act accordingly if the timeout elapses.

Related

Established Linux processes

We are running a IBM MDM server (initiate) which connects through a pooling mechanism to an Oracle DB server. The configuration of pooling has been set to 32. We also have a custom java process that submits data to this MDM server through an API that MDM server exposes. Once our custom java process (which does not open any DB connections directly) terminates, we see that the number of processes between MDM server and Db server has risen to some number greater than 32. After each nightly run, we see that the number of processes keeps on increasing and finally it reached the limit set by the Oracle DB (700) and the DB wont let any more connections to be opened to it and our process fails on that night. We are trying to figure why arent the processes getting terminated and why are they being still in ESTABLISHED mode (as per netstat command)
There are several reasons the number of processes could increase and sockets in ESTABLISHED STATE.
Typical mistake is spawning a child process for each message/connect/register and not reusing the existing connection. Especially there are timer callbacks involved
e.g.,
c - register for timer callback -> server
c -> spawn a process to receive the reply and listen on receive socket
c - register for timer callback -> server -> server
c -> spawn a process to receive the reply and listen on receive socket
instead it should be
c - register for timer callback -> server
c -> spawn a process to receive the reply and listen on receive socket
c - set the initialized flag
c - register for timer callback -> server
c -> if initialized do not spawn a process to receive the reply
Does the system suffers any exception after reaching the maximum limit?
Is the process created ares still active?
Are these process establishing the DB connection but not getting terminated?
Did Top output shows the active process?
1)Clear the old logs.
2) lsof data. This is an operating system command, that will tell us
what descriptors are being used by the app server process.
lsof -p PID > lsof.out
3) ulimits. These are the operating system resource limits
ulimit -a > ulimits.out
Please check the code that opens the connection are getting closed once it is use?
Check the lsof output and the status of the connection type?
I am working for IBM as Java Service Engineer. Please answer the questions above so that we could help you better.

SO_KEEPALIVE behavior is enabled by default on Linux?

I have a client/server application written in C using TCP sockets. I wanted to know dead server processes using SO_KEEPALIVE option enabled on client socket. I am using Linux.
I modified the default time from 2 hours to 10 minutes.
echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
I enabled SO_KEEPALIVE on client socket using setsockopt(). I intentionally killed(kill -9) the server process while it's sending data to client.
As expected, after 10 minutes timeout(plus additional time for probes), client socket got notified (read(scoket,...) returned zero).
However, to my surprise, even if I disable this option on client socket, it still gets notified after the specified timeout(read() returns zero).
Is this behavior by default enabled in Linux?
Also, I felt read() returning zero to be inappropriate, shouldn't read() return some error when the peer is dead?
Keepalive causes a connection reset. The only thing that causes read() to return zero is receiving a FIN. Ergo, you received a FIN, not a keepalive termination, and ergo this doesn't show that keepalive is enabled by default in Linux. It would be a violation of RFC 1122.

What constitutes "readable" (kqueue/epoll)

I know that if the remote host gracefully shuts down a connection, epoll will report EPOLLIN, and calling read or recv will not block, and will return 0 bytes (i.e. end of stream).
However, if the connection is not closed gracefully, and a write or send operation fails, does this cause epoll to subsequently return EPOLLIN for that socket, producing the same/similar end of stream scenario?
I've tried to find documentation on this behaviour, but have not succeeded, and while I could test it, I'm not interested in what happens on a specific distribution with a specific kernel version.
It is indeed not entirely obvious from the specification, but it works as follows for poll():
If there is data available to be read, even if the connection is closed, POLLIN is returned.
If neither reading or writing is possible because of a closed connection, POLLHUP or POLLERR is returned.
If reading is no longer possible but writing is (such as if the other side did shutdown(SHUT_WR)), POLLIN is returned and POLLHUP and POLLERR are not returned. (This allows waiting for POLLOUT normally.)
The simple thing to do is to try a read when any of POLLIN, POLLHUP and POLLERR are set.
In kqueue(), there is just an EVFILT_READ filter that may be triggered. This is described in the man page and should be clear enough.
Note that if you don't enable TCP keepalives (FreeBSD enables them by default but most other operating systems do not), waiting for data to read may get stuck forever if the network breaks in certain ways. Even if TCP keepalives are on, it tends to take a few hours to detect a broken connection.
It may not return EPOLLIN when the peer machine is closed unexpectly. In the past, I encounted this kind of phenomenon by VirtualBox as following steps:
Launch server on one VM.
Launch client on the other VM, connect the server and keep the connection without doing anything.
Save client VM state (something like hibernate).
And I saw the connection was still established in Server VM by
netstat -anp --tcp
In other words, EPOLLIN was not triggered in server.
http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ says that it will keep about 7200 seconds by default.
Of course, you can change keep alive timeout value by setsockopt or kernel parameters.
But some books says the better solution is to detect it in application layer, e.g. design the protocol that make sure sending some dummy messages periodically to detect the connection state.
epoll() is basically poll() but it scales better when you increase number of fds. I am not sure what it does when you are using it as edge-triggered interface. But for level triggered - yes, it will always return EPOLLIN, provided you are listening to this event, if end of stream is detected.
Though you must know TCP is not perfect. If connection is terminated abnormally (physycal link is down) by the other side, your side may never detect this until you write to the socket. TCP_KEEPALIVE may help, but not much.
However, if the connection is not closed gracefully, and a write or send operation fails, does this cause epoll to subsequently return EPOLLIN for that socket, producing the same/similar end of stream scenario?
No. That would imply receipt of a FIN, which means normal termination of the connection, which didn't happen. I would expect you would get an EPOLLERR or maybe EPOLLHUP.
But I'm curious why you wouldn't have already closed the socket on getting the write error, and why you would still be polling it. That's not correct behaviour.

Linux TCP weirdly unresponsive when under heavy load

I'm trying to get an HTTP server I'm writing on to behave well when under heavy load, but I'm getting some weird behavior that I cannot quite understand.
My testing consists of using ab (the Apache benchmark program) over the loopback interface at a concurrency level of 1000 (ab -n 50000 -c 1000 http://localhost:8080/apa), while straceing the server process. Strace both slows processing down well enough for the problem to be readily reproducible and allows me to debug the server internals post completion to some extent. I also capture the network traffic with tcpdump while the test is running.
What happens is that ab stops running a while into the test, complaining that a connection returned ECONNRESET, which I find a bit weird. I could easily buy into a connection timing out since the server might simply not have the bandwidth to process them all, but shouldn't that reasonably return ETIMEDOUT or even ECONNREFUSED if not all connections can be accepted?
I used Wireshark to extract the packets constituting the first connection to return ECONNRESET, and its brief packet list looks like this:
(The entire tcpdump file of this connection is available here.)
As you can see from this dump, the connection is accepted (after a few SYN retransmissions), and then the request is retransmitted a few times, and then the server resets the connection. I'm wondering, what could cause this to happen? Normally, Linux' TCP implementation ACKs data before the reading process even chooses to receive it so long as their is space in the TCP window, so why doesn't it do that here? Are there some kind of shared buffers that are running out? Most importantly, why is the kernel responding with a RST packet all of a sudden instead of simply waiting and letting the client re-transmit further?
For the record, the strace of the process indicates that it never even accepts a connection from the port in this connection (port 56946), so this seems to be something Linux does on its own. It is also worth noting that the server works perfectly well as long as ab's concurrency level is low enough (it works perfectly well up to about 100, and then starts failing intermittently somewhere between 100-500), and that its request throughput is rather constant regardless of the concurrency level (it processes somewhere between 6000-7000 requests per second as long as it isn't being straced). I have not found any particular correlation between the frequency of the problem occurring and my backlog setting to listen() (I'm currently using 128, but I've tried up to 1024 without it seeming to make a difference).
In case it matters, I'm running Linux 3.2.0 on this AMD64 box.
The backlog queue filled up: hence the SYN retransmissions.
Then a slot became available: hence the SYN/ACK.
Then the GET was sent, followed by four retransmissions, which I can't account for.
Then the server gave up and reset the connection.
I suspect you have a concurrency or throughput problem in your server which is preventing you from accepting connections rapidly enough. You should have a thread that is dedicated to doing nothing else but calling accept() and either starting another thread to handle the accepted socket or else queueing a job to handle it to a thread pool. I would then speculate that Linux resets connections on connections which are in the backlog queue and which are receiving I/O retries, but that's only a guess.

send (2) succeed with established connection on unreachable network

I have some troubles understanding send (2) syscall on my linux x86 box.
Consider I established an SSH connection in my app with the other host in LAN. Then I put down the network (e.g. unplug the cable) and call the function (from my app) that sends some SSH packets trough the connection. This function inside calls send like
w = send(s->fd_out,buffer, len, 0);
In debugger I found that send returns len (i.e. w == len after the call).
How this can be if network is unreachable? When I call netstat it says my SSH connection is in state ESTABLISHED even though the network is down.
Can't understand why send executes normally and don't return any error (like EPIPE or ECONNRESET). May be an SSH connection lives some time after the network put down?
Thanks to all.
It's due to the implementation of TCP (and ssh uses TCP). Your send() just writes to a socket, which is just a file descriptor, and return means this operation is successful. It doesn't mean the data has been sent. A file descriptor is just some pointer with state for kernel after all. It's implemented in the kernel to keep TCP state a bit longer before failing a session. In fact, kernel is allowed to indefinitely keep this session until you explicitly call close() or kill your process. So your data is actually buffered in kernel space for network card to deliver it later.
Here is a quick experiment you can do:
Write a server that keeps receiving messages after establishing a connection
socket();
bind();
listen();
while (1) {
accept();
recv();
}
Write a client establishes a connection, takes cin inputs, and send a message to server whenever you hit return.
socket();
connect();
while (1) {
getline();
send();
}
Be careful that you NEVER call close() in while loop on either side. Now, if you unplug your cable AFTER you've established a connection, send a message, reconnect again, and send another message, you will find both messages on the server side.
What you will NEVER observe is that you receive the second message before the first one. You either lose them all, or receive them in order.
Now let me explain why it behaves like this. This is the state diagram of a TCP session.
https://dl.dropbox.com/u/17011409/TCP_State.png
You can see clearly that until you explicitly call close(), the connection will always be in established state. That's expected behavior of TCP. Establishing TCP connection is expensive, and keeping a session alive is good for performance. (That's partially how those TCP DOS works. Attackers keep establishing connections until server runs out of resources to keep TCP state information.)
In this state, your send() will be delegated to kernel for actual sending. TCP guarantees in-order, reliable delivery, but network can lose packets at any time. So TCP HAVE TO buffer your packets, and keep trying. There are algorithms to throttle this retry, but it's buffered for quite a very long time before it declares failure. The default time out to assume a packet loss is 3 seconds in Linux. But after a loss, TCP will retry. Then try again after certain seconds. The fact you unplugged your cable is just the same situation as a packet loss along the way to the destination. Once you plug in your cable again, a retry succeeds, and TCP will start sending remaining messages in order.
I know I must have failed to explain it thoroughly. You really need to know the details of TCP to reason about this behavior. It's required for the properties TCP is giving you. And it's not acceptable to expose internal implementations to programmer. (How about a send call that sometimes returns within milliseconds, and sometimes returns after 10 seconds? I bet no one will want this performance bomb in their code. The point of having a TCP library is exactly to hide this ugly nature of networks.) In fact, you even need to understand multiple RFCs and algorithms of how TCP realize in-order reliable delivery over a lossy network. Congestion control comes into the play of how long the buffer will be there as well. Wikipedia is a good starting point, but it's a full semester's undergraduate course if you really want to understand the details.
With a zero flags argument, send() is equivalent to write(2). And it will write your data on file descriptor (stores in kernel space to deliver).
You have to use other types of flag: MSG_CONFIRM may help you.

Resources