Sockets & File Descriptor Reuse (or lack thereof) - linux

I am getting the error "Too many open files" after the call to socket in the server code below. This code is called repeatedly, and it only occurs just after server_SD gets the value 1022. so i am assuming that i am hitting the limit of 1024 as proscribed by "ulimit -n". What i don't understand is that i am closing the Socket, which should make the fd reusable, but this seems not to be happening.
Notes: Using linux, and yes the client is closed also, no i am not a root user so moving the limits is not an option, I should have a maximum of 20 (or so) sockets open at one time. Over the lifetime of my program i would expect to open & close close to 1000000 sockets (hence need to reuse very strong).
server_SD = socket (AF_INET, SOCK_STREAM, 0);
bind (server_SD, (struct sockaddr *) &server_address, server_len)
listen (server_SD,1)
client_SD = accept (server_SD, (struct sockaddr *)&client_address, &client_len)
// read, write etc...
shutdown (server_SD, 2);
close (server_SD)
Does anyone know how to guarantee closure & re-usability ?
Thanks.

Run your program under valgrind with the --track-fds=yes option:
valgrind --track-fds=yes myserver
You may also need --trace-children=yes if your program uses a wrapper or it puts itself in the background.
If it doesn't exit on its own, interrupt it or kill the process with "kill pid" (not -9) after it accumulates some leaked file descriptors. On exit, valgrind will show the file descriptors that are still open and the stack trace corresponding to where they were created.
Running your program under strace to log all system calls may also be helpful. Another helpful command is /usr/sbin/lsof -p pid to display all currently used file descriptors and what they are being used for.

From your description it looks like you are opening server socket for each accept(2). That is not necessary. Create server socket once, bind(2) it, listen(2), then call accept(2) on it in a loop (or better yet - give it to poll(2))
Edit 0:
By the way, shutdown(2) on listening socket is totally meaningless, it's intended for connected sockets only.

Perhaps your problem is that you're not specifying the SO_REUSEADDR flag?
From the socket manpage:
SO_REUSEADDR
Indicates that the rules used in validating addresses supplied in a bind(2) call should allow reuse of local addresses. For PF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address. When the listening socket is bound to INADDR_ANY with a specific port then it is not possible to bind to this port for any local address.

Are you using fork()? if so, your children may be inheriting the opened file descriptors.
If this is the case, you should have the child close any fds that don't belong to it.

This looks like you might have a "TIME_WAIT" problem. IIRC, TIME_WAIT is one of the status a TCP socket can be in, and it's entered when both side have closed the connection, but the system keeps the socket for a while, to avoid delayed messages to be accepted as proper payload for subsequent connections.
You shoud maybe have a look at this (bottom of page 99 and top of 100). And maybe that other question.

One needs to close the client before closing the server (reverse order to my code above!)
Thanks all who offered suggestions !

Related

How to merge three TCP streams in realtime

I have three bits of networked realtime data logging equipment that output lines of ASCII text via TCP sockets. They essentially just broadcast the data that they are logging - there are no requests for data from other machines on the network. Each piece of equipment is at a different location on my network and each has a unique IP address.
I'd like to combine these three streams into one so that I can log it to a file for replay or forward it onto another device to view in realtime.
At the moment I have a PHP script looping over each IP/port combination listening for up to 64Kb of data. As soon as the data is received or it gets an EOL then it forwards that on to another which that listens to the combined stream.
This works reasonably well but one of the data loggers outputs far more data than the others and tends to swamp the other machines so I'm pretty sure that I'm missing data. Presumably because it's not listening in parallel.
I've also tried three separate PHP processes writing to a shared file in memory (on /dev/shm) which is read and written out by a fourth process. Using file locking this seems to work but introduces a delay of a few seconds which I'd rather avoid.
I did find a PHP library that allows true multithreading using Pthreads called (I think) Amp but I'm still not sure how to combine the output. A file in RAM doesn't seem quick enough.
I've had a good look around on Google and can't see an obvious solution. There certainly doesn't seem to be a way to do this on Linux using command line tools that I've found unless I've missed something obvious.
I'm not too familiar with other languages but are there other languages that might be better suited to this problem ?
Based on the suggested solution below I've got the following code almost working however I get an error 'socket_read(): unable to read from socket [107]: Transport endpoint is not connected'. This is odd as I've set the socket to accept connections and made it non-blocking. What am I doing wrong ?:
// Script to mix inputs from multiple sockets
// Run forever
set_time_limit (0);
// Define address and ports that we will listen on
$localAddress='';
// Define inbound ports
$inPort1=36000;
$inPort2=36001;
// Create sockets for inbound data
$inSocket1=createSocket($localAddress, $inPort1);
$inSocket2=createSocket($localAddress, $inPort2);
// Define buffer of data to read and write
$buffer="";
// Repeat forever
while (true) {
// Build array streams to monitor
$readSockets=array($inSocket1, $inSocket2);
$writeSockets=NULL;
$exceptions=NULL;
$t=NULL;
// Count number of stream that have been modified
$modifiedCount=socket_select($readSockets, $writeSockets, $exceptions, $t);
if ($modifiedCount>0) {
// Process inbound arrays first
foreach ($readSockets as $socket) {
// Get up to 64 Kb from this socket
$buffer.=socket_read($socket, 65536, PHP_BINARY_READ);
}
// Process outbound socket array
foreach ($writeSockets as $socket) {
// Get up to 64 Kb from this socket and add it to any other data that we need to write out
//socket_write($socket, $buffer, strlen($buffer));
echo $buffer;
}
// Reset buffer
$buffer="";
} else {
echo ("Nothing to read\r\n");
}
}
function createSocket($address, $port) {
// Function to create and listen on a socket
// Create socket
$socket=socket_create(AF_INET, SOCK_STREAM, 0);
echo ("SOCKET_CREATE: " . socket_strerror(socket_last_error($socket)) . "\r\n");
// Allow the socket to be reused otherwise we'll get errors
socket_set_option($socket, SOL_SOCKET, SO_REUSEADDR, 1);
echo ("SOCKET_OPTION: " . socket_strerror(socket_last_error($socket)) . "\r\n");
// Bind it to the address and port that we will listen on
$bind=socket_bind($socket, $address, $port);
echo ("SOCKET_BIND: " . socket_strerror(socket_last_error($socket)) . " $address:$port\r\n");
// Tell socket to listen for connections
socket_listen($socket);
echo ("SOCKET_LISTEN: " . socket_strerror(socket_last_error($socket)) . "\r\n");
// Make this socket non-blocking
socket_set_nonblock($socket);
// Accept inbound connections on this socket
socket_accept($socket);
return $socket;
}
You don't necessary need to switch languages, it just sounds like you're not familiar with the concept of IO multiplexing. Check out some documentation for the PHP select call here
The concept of listening to multiple data inputs and not knowing which one some data will come from next is a common one and has standard solutions. There are variations on exactly how its implemented but the basic idea is the same: you tell the system that you're interested in receiving data from multiple source simultaneously (TCP sockets in your case), and run a loop waiting for this data. On every iteration of the loop the system the system tells you which source is ready for reading. In your case that means you can piecemeal-read from all 3 of your sources without waiting for an individual one to reach 64KB before moving on to the next.
This can be done in lots of languages, including PHP.
UPDATE: Looking at the code you posted in your update, the issue that remains is that you're trying to read from the wrong thing, namely from the listening socket rather than the connection socket. You are ignoring the return value of socket_accept in your createSocket function which is wrong.
Remove these lines from createSocket:
// Accept inbound connections on this socket
socket_accept($socket);
Change your global socket creation code to:
// Create sockets for inbound data
$listenSocket1=createSocket($localAddress, $inPort1);
$listenSocket2=createSocket($localAddress, $inPort2);
$inSocket1=socket_accept($listenSocket1);
$inSocket2=socket_accept($listenSocket2);
Then your code should work.
Explanation: when you create a socket for binding and listening, its sole function then becomes to accept incoming connections and it cannot be read from or written to. When you accept a connection a new socket is created, and this is the socket that represents the connection and can be read/written. The listening socket in the meantime continues listening and can potentially accept other connections (this is why a single server running on one http port can accept multiple client connections).

Why are hanging SSH commands waiting for output from a pipe with both ends open in 'sshd' on the server?

This is on StackOverflow as opposed to SuperUser/ServerFault since it has to do with the syscalls and OS interactions being performed by sshd, not the problem I'm having using SSH (though assistance with that is appreciated, too :p).
Context:
I invoke a complex series of scripts via SSH, e.g. ssh user#host -- /my/command. The remote command does a lot of complex forking and execcing and eventually results in a backgrounded daemon process running on the remote host. Occasionally (I'm slowly going mad trying to find out reliable reproduction conditions), the ssh command will never return control to the client shell. In those situations, I can go onto the target host and see an sshd: user#notty process with no children hanging indefinitely.
Fixing that issue is not what this question is about. This question is about what that sshd process is doing.
The SSH implementation is OpenSSH, and the version version is 5.3p1-112.el6_7.
The problem:
If I find one of those stuck sshds and strace it, I can see it's doing a select on two handles, e.g. select(12, [3 6], [], NULL, NULL or similar. lsof tells me that one of those handles is the TCP socket connecting back to the SSH client. The other is a pipe, the other end of which is only open in the same sshd process. If I search for that pipe by ID using the answer to this SuperUser question, the only process that contains references to that pipe is the same process. lsof confirms this: both the read and write ends of the pipe are open in the same process, e.g. (for pipe 788422703 and sshd PID 22744):
sshd 22744 user 6r FIFO 0,8 0t0 788422703 pipe
sshd 22744 user 7w FIFO 0,8 0t0 788422703 pipe
Questions:
What is SSH waiting for? If the pipe isn't connected to anything and there are no child processes, I can't imagine what event it could be expecting.
What is that "looped" pipe/what does it represent? My only theory is that maybe if STDIN isn't supplied to the SSH client, the target host sshd opens a dummy STDIN pipe so some of its internal child-management code can be more uniform? But that seems pretty tenuous.
How does SSH get into this situation?
What I've Tried/Additional Info:
Initially, I thought this was a handle leak to a daemon. It's possible to create a waiting, child-less sshd process by issuing a command that backgrounds itself, e.g. ssh user#host -- 'sleep 60 &'; sshd will wait for the streams to be closed to the daemonized process; not just the exit of its immediate child. Since the scripts in question eventually result (way down the process tree) in a daemon being started, it initially seemed possible that the daemon was holding onto a handle. However, that doesn't seem to hold up--using the sleep 60 & command as an example, sshd processes communicating with daemons hold and select on four open pipes, not just two, and at least two of the pipes are connected from sshd to the daemon process, not looped. Unless there's a method of tracking/pointing to a pipe I don't know about (and there likely is--for example, I have no idea how duped filehandles play into close() semaphore waiting or piping), I don't think the pipe-to-self situation represents a waiting-on-daemon case.
sshd periodically receives communication on the TCP socket/ssh connection itself, which wakes it up out of the selects for a brief period of communication (during which strace shows it blocking SIGCHLD), and then it goes back to waiting on the same FDs.
It's possible that I'm being affected by this race condition (SIGCHLD getting delivered before the kernel makes data available in the pipe). However, that seems unlikely, both given the rate at which this condition manifests, and the fact that the processes being run on the target host are Perl scripts, and the Perl runtime closes and flushes open file descriptors on shutdown.
It seems that you're describing the notify pipe. The OpenSSH sshd main loop calls select() to wait until it has something to do. The file descriptors being polled include the TCP connection to the client and any descriptors used to service active channels.
sshd wants to be able to interrupt the select() call when a SIGCHLD signal is received. To do that, sshd installs a signal handler for SIGCHLD and it creates a pipe. When a SIGCHLD signal is received, the signal handler writes a byte into the pipe. The read end of the pipe is included in the list of file descriptors polled by select(). The act of writing to the pipe would cause the select() call to return with an indication that the notify pipe is readable.
All of the code is in serverloop.c:
/*
* we write to this pipe if a SIGCHLD is caught in order to avoid
* the race between select() and child_terminated
*/
static int notify_pipe[2];
static void
notify_setup(void)
{
if (pipe(notify_pipe) < 0) {
error("pipe(notify_pipe) failed %s", strerror(errno));
} else if ((fcntl(notify_pipe[0], F_SETFD, 1) == -1) ||
(fcntl(notify_pipe[1], F_SETFD, 1) == -1)) {
error("fcntl(notify_pipe, F_SETFD) failed %s", strerror(errno));
close(notify_pipe[0]);
close(notify_pipe[1]);
} else {
set_nonblock(notify_pipe[0]);
set_nonblock(notify_pipe[1]);
return;
}
notify_pipe[0] = -1; /* read end */
notify_pipe[1] = -1; /* write end */
}
static void
notify_parent(void)
{
if (notify_pipe[1] != -1)
write(notify_pipe[1], "", 1);
}
[...]
/*ARGSUSED*/
static void
sigchld_handler(int sig)
{
int save_errno = errno;
child_terminated = 1;
#ifndef _UNICOS
mysignal(SIGCHLD, sigchld_handler);
#endif
notify_parent();
errno = save_errno;
}
The code to set up and perform the select call is in another function called wait_until_can_do_something(). It's fairly long so I won't include it here. OpenSSH is open source, and this page describes how to download the source code.

what will happen if we close a closed socket

I wonder what will happen if we close a closed socket or a non-existing socket?
Will the exception affect the other sockets which are sending/receiving packets?
Edit:
Sorry, I didn't say it clearly. I mean I know what it will return from close or shutdown function and what the return means, but I don't know what it affects the existing sockets.
Potentially, yes. If you call close on a random integer which used to be an fd, you might break some other part of your code that's just opened another connection that got given the same fd number. Therefore, you should never double-close an fd: although it's perfectly safe from the kernel's point of view (you harmlessly get EBADF), it can seriously mess up your application.
Or close(): per http://pubs.opengroup.org/onlinepubs/000095399/functions/close.html
will return -1 and set errno to EBADF. The fildes argument is not a valid file descriptor.

Where does Linux kernel do process and TCP connections cleanup after process dies?

I am trying to find place in the linux kernel where it does cleanup after process dies. Specifically, I want to see if/how it handles open TCP connections after process is killed with -9 signal. I am pretty sure it closes all connections, but I want to see details, and if there is any chance that connections are not closed properly.
Pointers to linux kernel sources are welcome.
The meat of process termination is handled by exit.c:do_exit(). This function calls exit_files(), which in turn calls put_files_struct(), which calls close_files().
close_files() loops over all file descriptors the process has open (which includes all sockets), calling filp_close() on each one, which calls fput() on the struct file object. When the last reference to the struct file has been put, fput() calls the file object's .release() method, which for sockets, is the sock_close() function in net/socket.c.
I'm pretty sure the socket cleanup is more of a side effect of releasing all the file descriptors after the process dies, and not directly done by the process cleanup.
I'm going to go out on a limb though, and assume you're hitting a common pitfall with network programming. If I am correct in guessing that your problem is that you get an "Address in use" error (EADDRINUSE) when trying to bind to an address after a process is killed, then you are running into the socket's TIME_WAIT.
If this is the case, you can either wait for the timeout, usually 60 seconds, or you can modify the socket to allow immediate reuse like so.
int sock, ret, on;
struct sockaddr_in servaddr;
sock = socket( AF_INET, SOCK_STREAM, 0 ):
/* Enable address reuse */
on = 1;
ret = setsockopt( sock, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on) );
[EDIT]
From your comments, It sounds like you are having issues with half-open connections, and don't fully understand how TCP works. TCP has no way of knowing if a client is dead, or just idle. If you kill -9 a client process, the four-way closing handshake never completes. This shouldn't be leaving open connections on your server though, so you still may need to get a network dump to be sure of what's going on.
I can't say for sure how you should handle this without knowing exactly what you are doing, but you can read about TCP Keepalive here. A couple other options are sending empty or null messages periodically to the client (may require modifying your protocol), or setting hard timers on idle connections (may result in dropped valid connections).

Question about epoll and splice

My application is going to send huge amount of data over network, so I decided (because I'm using Linux) to use epoll and splice. Here's how I see it (pseudocode):
epoll_ctl (file_fd, EPOLL_CTL_ADD); // waiting for EPOLLIN event
while(1)
{
epoll_wait (tmp_structure);
if (tmp_structure->fd == file_descriptor)
{
epoll_ctl (file_fd, EPOLL_CTL_DEL);
epoll_ctl (tcp_socket_fd, EPOLL_CTL_ADD); // wait for EPOLLOUT event
}
if (tmp_structure->fd == tcp_socket_descriptor)
{
splice (file_fd, tcp_socket_fd);
epoll_ctl (tcp_socket_fd, EPOLL_CTL_DEL);
epoll_ctl (file_fd, EPOLL_CTL_ADD); // waiting for EPOLLIN event
}
}
I assume, that my application will open up to 2000 TCP sockets. I want o ask you about two things:
There will be quite a lot of epoll_ctl calls, won't wit be slow when I will have so many sockets?
File descriptor has to become readable first and there will be some interval before socket will become writable. Can I be sure, that at the moment when socket becomes writable file descriptor is still readable (to avoid blocking call)?
1st question
You can use edge triggered rather then even triggered polling thus you do not have to delete socket each time.
You can use EPOLLONESHOT to prevent removing socket
File descriptor has to become readable first and there will be some interval before socket will become writable.
What kind of file descriptor? If this file on file system you can't use select/poll or other tools for this purpose, file will be always readable or writeable regardless the state if disk and cache. If you need to do staff asynchronous you may use aio_* API but generally just read from file write to file and assume it is non-blocking.
If it is TCP socket then it would be writeable most of the time. It is better to use
non-blocking calls and put sockets to epoll when you get EWOULDBLOCK.
Consider using EPOLLET flag. This is definitely for that case. When using this flag you can use event loop in a proper way without deregistering (or modifying mode on) file descriptors since first registration in epoll. :) enjoy!

Resources