Network map/mon tool clogs up telnet daemon / login processes - linux

We have an embedded Linux (Kernel 2.6.x / Busybox) system (IP camera/web server) which is being tripped over by a network mapping/monitoring tool (specifically The Dude but I think the problem is a general one) repeatedly probing the Telnet port.
The sequence of events is this:
The tool probes port 23
Our system's Telnet daemon (busybox telnetd) spawns a new /bin/login thread
The tool, having satisfied itself there's something there, skips merrily on its way (it neither logs in nor closes the connection)
This keeps happening (every N seconds) until there are so many sockets open that our system can no longer serve a web page through lack of sockets, and there are hundreds of bin/login processes hanging around.
Apologies for vagueness, full logs & wireshark captures are on a different PC at this moment.
As I see it, we need to do a couple of things:
Put some sort of timeout on the telnet client / bin/login process if no login attempt is made
Put some sort of limit on the number of ports the telnet client can have open at any time
Kill off hanging / zombie sockets (TCP timeout / keepalive config?)
I'm not 100% clear on the correct approach to these three, given that the device is also serving web pages and streaming video so changes to system globals may impact the other services. I was a little surprised that Busybox seems to be open to what's effectively the world's slowest DDOS attack.
Edit:
I've worked out what I think is a reasonable way round this, a new question started for ironing out the wrinkles in that idea. Basically, login exits as soon as someone logs in, so we can kill logins with (relative) impunity when a new instance is launched.

The tool, having satisfied itself there's something there, skips merrily on its way (it neither logs in nor closes the connection)
That is an issue and in fact a type of DOS attack.
This keeps happening (every N seconds) until there are so many sockets open that our system can no longer serve a web page through lack of sockets, and there are hundreds of bin/login processes hanging around.
Apart from stopping the DOS attack you might want to mitigate it using some tools. In this specific case you might want to configure low TCP timeouts and thus get the sockets closed soon after they are open due to inactivity on the other end and thus the login processes should be terminated as well.

Well, let me answer my own question by linking to myself answering my own question on killing zombie logins each time a new one is spawned.
The solution is forcing telnetd to run a script instead of /bin/login, where the script kills any other instances of /bin/login before running a new one. Not ideal but hopefully solving the problem.

Related

Shorting the time between process crash and shooting server in the head?

I have a routine that crashes linux and force a reboot using a system function.
Now I have the problem that I need to crash linux when a certain process dies. Using a script starting the process and if the script ends restart the server is not appropriate since it takes some ms.
Another idea is spawning the shooting processes alongside and use polling of a counter and if the counter is not incremented reboot the server would be another idea.
This would result in an almost instant reaction.
Now the question is what would be a good timeframe. I have no idea how the scheduler of linux would guarantee a certain update of any such counter and what a good timeout would be.
Also I would like to hear some alternatives to this second process spawning. Is there a possibility to advice linux to run a certain routine in case of a crash of the given process or a listener meachanism for the even of problems with a given process?
The timeout idea is already implemented in the kernel. You can register any application as a software watchdog, but you'll have to lower the default timeout. Have a look at http://linux.die.net/man/8/watchdog for some ideas. That application can also handle user-defined tests. Realistically unless you're trying to run kernel like linux-rt, having timeouts lower than 100ms can be dangerous on a system with heavy load - especially if the check needs to poll another application.
In cases of application crashes, you can handle them if your init supports notifications. For example both upstart and systemd can do that by monitoring files (make sure coredumps are created in the right place).
But in any case, I'd suggest rethinking the idea of millisecond-resolution restarts. Do you really need to kill the system in that time, or do you just need to isolate it? Just syncing the disks will take a few extra milliseconds and you probably don't want to miss that step. Instead of just killing the host, you could make sure the affected app isn't working (SIGABRT?) and kill all networking (flush iptables, change default to DROP).

Deal with a large number of outgoing TCP connections in linux

I'm building a service that periodically connects to a large number of devices (thousands) via TCP sockets. These connections are all established at the same time, a couple of api commands are sent through each, then the sockets are closed. All devices are in the same subnet.
The trouble starts after about 1000 devices. New connections are not established. After waiting for a couple of minutes, everything is back to normal. My first guess was that the maximum number of socket connections had been reached and after reading through many similar questions and tutorials, I modified some kernel networking parameters like various cache sizes, max number of open files, tcp_tw_reuse, somaxconn. Unfortunately, it had little to no effect.
The problem does not seem to be burst-related: The first time I run the script, it works fine, but when I start it again a couple of minutes later, then I start seeing these errors. My best guess was that the number of open sockets builds up over time, possibly in the TIME_WAIT state. On the other hand, setting the tcp_tw_reuse parameter (which seems perfect for this scenario) did not have any noticeable effect. I close the sockets via pythons socket.close().
It may be important to stress that this question is not about a high-load server, but a high-load client! The connections are outgoing. I saw many server-related questions that were answered with the solutions I described above.

Would it be considered bad practice to restart apache (with graceful) every 1 / 5 / 10 minutes?

I have a server running virtual hosts that get changed quite often. Rather than someone actually going to the server and typing in the apache restart command I was thinking of making a cron (every 1, 5 or 10 minutes, maybe only during working hours, when changes to the virtual hosts are actually made) to restart apache gracefully.
sudo apachectl graceful
I found an explanation here on stackoverflow that goes like this:
Graceful does not wait for active connections to die before doing a "full restart". It is the same as doing a HUP against the master process. Apache keeps children (processes) with active connections alive, whilst bringing up new children with new configuration (or nicely cleared caches) for each new connection. As the old connections die off, those child processes are killed as well to make way for the new ones.
Would this mean that there would be little to no impact on the visitor's experience (long wait times), or should I just stick with manually restarting apache?
Thanks!
Sorry, but I don't consider that a good idea.
If you're planning on restarting Apache every X minutes, even though it may not need it, I see plenty of downside there but no upside.
If you're just checking and restarting when needed, such as with a process running which can detect when a change is needed, that might be okay.
Personally, I wouldn't even do that since I'd rather keep control over deployment changes. For example, if you wanted to get a whole lot of stuff installed during the working day ready for restart but not actually activate it till quiet time.
Of course, in a robust environment, you'd be running multiple servers so you could offline them one at a time for changes, without affecting anyone.

How to monitor open file descriptors in Ruby on Rails?

Background:
I had an issue with my Rails server recently where it would stop responding, requiring a bounce to get it back up and running. This issue was due to a controller that does some forking upon receiving a POST, to do some heavy-weight concurrent processing -- server response time kept increasing until the server completely stopped responding. I'm pretty sure I have fixed that issue (DB connections copied upon fork weren't getting closed in child processes), but it would be great to authoritatively test that.
Question:
Is there a way to monitor open file descriptors from inside my Rails app? It's running on Linux, so I've been mucking around with the proc filesystem and the lsof command to observe the open file descriptors; this is messy, because it only gives you a snapshot of the current processes. Ideally I would like to print the open file descriptors in the parent and child processes before, during, and after the processing, to ensure that file descriptors don't stay open past their welcome.
One method to consider (probably the simplest) is using a background worker of some sort, such as with Workling, and making it run lsof in intervals, and getting output using syntax:
`lsof | grep something` # shell command example.
Programs like lsof can really hurt performance if run too frequently. Perhaps every 10s to 30s. Perhaps down to maybe 5s, but that's really pushing it. I'm assuming you have a dedicated server or a beasty virtual machine.
In your background worker, you can store these command results into a variable, or grep it down to what you're really looking for (as demonstrated), and access/manipulate the data as you please.

Debugging utilities for Linux process hang issues?

I have a daemon process which does the configuration management. all the other processes should interact with this daemon for their functioning. But when I execute a large action, after few hours the daemon process is unresponsive for 2 to 3 hours. And After 2- 3 hours it is working normally.
Debugging utilities for Linux process hang issues?
How to get at what point the linux process hangs?
strace can show the last system calls and their result
lsof can show open files
the system log can be very effective when log messages are written to track progress. Allows to box the problem in smaller areas. Also correlate log messages to other messages from other systems, this often turns up interesting results
wireshark if the apps use sockets to make the wire chatter visible.
ps ax + top can show if your app is in a busy loop, i.e. running all the time, sleeping or blocked in IO, consuming CPU, using memory.
Each of these may give a little bit of information which together build up a picture of the issue.
When using gdb, it might be useful to trigger a core dump when the app is blocked. Then you have a static snapshot which you can analyze using post mortem debugging at your leisure. You can have these triggered by a script. The you quickly build up a set of snapshots which can be used to test your theories.
One option is to use gdb and use the attach command in order to attach to a running process. You will need to load a file containing the symbols of the executable in question (using the file command)
There are a number of different ways to do:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
You can use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.

Resources