My system includes a task which opens a network socket, receives pushed data from the network, processes it, and writes it out to disk or pings other machines depending on the messages. This task is intended to run forever, and the service is designed to have this task always running. But sometimes it crashes.
What's the best practice for keeping a task like this alive? Assume it's okay for the task to be dead for up to 30 seconds before we restart it.
Some obvious ideas include having a watchdog process that checks to make sure the process is still running. Watchdog could be triggered by cron. But how does it know if the process is alive or not? Write a pidfile? touch a heartbeat file? An ideal solution wouldn't continuously spin up more processes if the machine gets bogged down to the point where the watchdog is running faster than the heartbeat.
Are there standard linux tools for this? I can imagine a solution that uses a message queue, but I'm not sure if that's a good idea or not.
Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().
The wrapper task can then do a waitpid() on the child and restart it if it is terminated.
This does depend on modifying the source for the task that you wish to run.
sysvinit will restart processes that die, if added to inittab.
If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.
You could use monit along with daemonize. There are lots of tools for this in the *nix world.
Supervisor was designed precisely for this task. From the project website:
Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.
The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:
[program:my-network-task]
command=/bin/my-network-task # where your binary lives
autostart=true # start when supervisor starts?
autorestart=true # restart automatically when stopped?
startsecs=10 # consider start successful after how many secs?
startretries=3 # try starting how many times?
I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.
Related
I have a node.js script that run once in a day on ubuntu EC2 instance. This script pulls data from some hundered thousand remote APIs and save to our local database. Is there any way we can monitor this node.js script on remote server? There have been few instances where script crashed due to some reason and we were unable to figure it out without SSHing into instance and checking the logs. I have however created a small system after first few crashes which send us an email whenever script crashes due to some uncaught exception and also when script completes execution.
However, we need to develop a better system where we can monitor the progress of script via web interface of our admin application which is deployed over some other instance and also trigger start/stop of script via this interface. What are possible options for achieving this?
If you like to stay in Node.js, then there are several process monitoring tools:
PM2 comes with lots of other features besides monitoring processes. You can monitor your processes via CLI or their official web interface: https://keymetrics.io/. A quick search on npm also gives a bunch of nice unofficial gui tools: https://www.npmjs.com/search?q=pm2+web
Forever is not as feature rich as PM2 but will do the basic process operations and couple of gui are also available in npm.
There are two problems here that you are trying to solve:
Scheduling work to be done
Monitoring a process for failure
At a simple level, this is easy: schedule a cron job and restart failed things so they keep trying.
However, when things don't go smoothly, it helps to have a lot more granularity over what you are scheduling, and how it is executed. This would also give you the visibility over each little piece of work.
Adding a little more complexity, you can end up with something like this:
Schedule the script that starts everything (via cron, if that's comfortable)
That script generates several jobs that need to be executed into a queue
A worker process (or n worker processes) consume that queue and execute pending jobs
You can monitor both the progress of the jobs, as well as the state of each worker (# of crashes, failures, jobs completed, etc.). The other tools mentioned above are good candidates for this (forever, pm2, etc.)
When jobs fail, other workers can pick up the small piece of work that was in progress and restart it. This is much more efficient than restarting the entire process, and also lets you parallelize things across n workers based on how you can split up the workloads.
You could easily throw the status onto a web app so you can check in periodically rather than have to dig through server logs.
You can also get more intelligent with different types of failures. Network error? Retry 5 times. Rated limited? Gradual back-off. Crash? Don't retry and notify via email. etc
I have tried this with pm2, you can get the info of the task, then cat out or grab the log files. Or you could have a logging server, see also: https://github.com/papertrail/remote_syslog2
I have a routine that crashes linux and force a reboot using a system function.
Now I have the problem that I need to crash linux when a certain process dies. Using a script starting the process and if the script ends restart the server is not appropriate since it takes some ms.
Another idea is spawning the shooting processes alongside and use polling of a counter and if the counter is not incremented reboot the server would be another idea.
This would result in an almost instant reaction.
Now the question is what would be a good timeframe. I have no idea how the scheduler of linux would guarantee a certain update of any such counter and what a good timeout would be.
Also I would like to hear some alternatives to this second process spawning. Is there a possibility to advice linux to run a certain routine in case of a crash of the given process or a listener meachanism for the even of problems with a given process?
The timeout idea is already implemented in the kernel. You can register any application as a software watchdog, but you'll have to lower the default timeout. Have a look at http://linux.die.net/man/8/watchdog for some ideas. That application can also handle user-defined tests. Realistically unless you're trying to run kernel like linux-rt, having timeouts lower than 100ms can be dangerous on a system with heavy load - especially if the check needs to poll another application.
In cases of application crashes, you can handle them if your init supports notifications. For example both upstart and systemd can do that by monitoring files (make sure coredumps are created in the right place).
But in any case, I'd suggest rethinking the idea of millisecond-resolution restarts. Do you really need to kill the system in that time, or do you just need to isolate it? Just syncing the disks will take a few extra milliseconds and you probably don't want to miss that step. Instead of just killing the host, you could make sure the affected app isn't working (SIGABRT?) and kill all networking (flush iptables, change default to DROP).
We have a custom setup which has several daemons (web apps + background tasks) running. I am looking at using a service which helps us to monitor those daemons and restart them if their resource consumption exceeds over a level.
I will appreciate any insight on when one is better over the other. As I understand monit spins up a new process while supervisord starts a sub process. What is the pros and cons of this approach ?
I will also be using upstart to monitor monit or supervisord itself. The webapp deployment will be done using capistrano.
Thanks
I haven't used monit but there are some significant flaws with supervisord.
Programs should run in the foreground
This means you can't just execute /etc/init.d/apache2 start. Most times you can just write a one liner e.g. "source /etc/apache2/envvars && exec /usr/sbin/apache2 -DFOREGROUND" but sometimes you need your own wrapper script. The problem with wrapper scripts is that you end up with two processes, a parent and child. See the the next flaw...
supervisord does not manage child processes
If your program starts child process, supervisord wont detect this. If the parent process dies (or if it's restarted using supervisorctl) the child processes keep running but will be "adopted" by the init process and stay running. This might prevent future invocations of your program running or consume additional resources. The recent config options stopasgroup and killasgroup are supposed to fix this, but didn't work for me.
supervisord has no dependency management - see #122
I recently setup squid with qlproxy. qlproxyd needs to start first otherwise squid can fail. Even though both programs were managed with supervisord there was no way to ensure this. I needed to write a start script for squid that made it wait for the qlproxyd process. Adding the start script resulted in the orphaned process problem described in flaw 2
supervisord doesn't allow you to control the delay between startretries
Sometimes when a process fails to start (or crashes), it's because it can't get access to another resource, possibly due to a network wobble. Supervisor can be set to restart the process a number of times. Between restarts the process will enter a "BACKOFF" state but there's no documentation or control over the duration of the backoff.
In its defence supervisor does meet our needs 80% of the time. The configuration is sensible and documentation pretty good.
If you want to additionally monitor resources you should settle for monit. In addition to just checking whether a process is running (availability), monit can also perform some checks of resource usage (performance, capacity usage), load levels and even basic security checks (md5sum of a bianry file, config file, etc). It has a rule-based config which is quite easy to comprehend. Also there is a lot of ready to use configs: http://mmonit.com/wiki/Monit/ConfigurationExamples
Monit requires processes to create PID files, which can be a flaw, because if a process does not create pid file you have to create some wrappers around. See http://mmonit.com/wiki/Monit/FAQ#pidfile
Supervisord on the other hand is more bound to a process, it spawns it by itself. It cannot make any resource based checks as monit. It has a nice CLI servicectl and a web GUI though.
I have a dashboard and I want a process to run when the user clicks on a button. That process might take a long time to complete.
My options so far:
using popen or something similar to execute the process
having a daemon monitor a directory. When this directory is changed (a file created) the daemon will do the job and then delete the file before idling again.
using cron, running every 5 seconds and also monitoring some directory.
Which one is more Linux-friendly? Is there any I have not considered?
This is what task queueing systems like Celery and Redis Queue are for.
Another option is to have a daemon (as in your 2nd option) that listen on some socket. Then, your WSGI application could just connect & send a command. There are many possibilities for how the communication over the socket would take place, choosing the right one depends a lot on the actual case.
This have the advantage that you can eventually have the two application (WSGI and the daemon) run on different computers or VMs at some point.
I have a daemon process which does the configuration management. all the other processes should interact with this daemon for their functioning. But when I execute a large action, after few hours the daemon process is unresponsive for 2 to 3 hours. And After 2- 3 hours it is working normally.
Debugging utilities for Linux process hang issues?
How to get at what point the linux process hangs?
strace can show the last system calls and their result
lsof can show open files
the system log can be very effective when log messages are written to track progress. Allows to box the problem in smaller areas. Also correlate log messages to other messages from other systems, this often turns up interesting results
wireshark if the apps use sockets to make the wire chatter visible.
ps ax + top can show if your app is in a busy loop, i.e. running all the time, sleeping or blocked in IO, consuming CPU, using memory.
Each of these may give a little bit of information which together build up a picture of the issue.
When using gdb, it might be useful to trigger a core dump when the app is blocked. Then you have a static snapshot which you can analyze using post mortem debugging at your leisure. You can have these triggered by a script. The you quickly build up a set of snapshots which can be used to test your theories.
One option is to use gdb and use the attach command in order to attach to a running process. You will need to load a file containing the symbols of the executable in question (using the file command)
There are a number of different ways to do:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
You can use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.