Deploying app when long-lasting cronjobs/workers are running? - cron

Let's suppose that we have some long-lasting cronjob/worker, running or some production servers. We have new application code release and ready to deploy it into production.
What should we do with this cronjob/workers?
Wait for cronjob/worker complete it's job and do not start next on until deployment is over + restrict all other cronjobs/workers start until deployment is over
Implement SIGTERM handlers in cronjob/worker & send SIGTERM to all nessesary processes before deploy? (sometimes it is rather difficult to implement this kind of hanglers)
Split long lasting cronjob/worker into parts, push into queue and forget about this problem?
Any ideas?

I'm assuming you are worried about writing files that at the time of deployment are being executed by cron or changing files that those jobs may use?
You could have job.lock files that each job touches when the job runs and deletes once it's done.
At the start of a deployment, you could
create a deploy.lock file that all jobs/workers watch and don't start if present
have your deployment process watch the job.lock files, and only proceed if there's none
deletes deploy.lock once done
In essence, the same thing that you suggest in your first bullet point.
I've read this question not long ago and it discusses the exact concern
Can a shell script delete or overwrite itself?
I'm not that confident about shell read and disk write buffers so I don't want to talk nonsense. Maybe my approach is too simplistic anyway or isn't effective at all, maybe you overwriting a file that is being executed is harmless. Curious to find out too.

Related

Monitor node.js scripts running on ubuntu instance

I have a node.js script that run once in a day on ubuntu EC2 instance. This script pulls data from some hundered thousand remote APIs and save to our local database. Is there any way we can monitor this node.js script on remote server? There have been few instances where script crashed due to some reason and we were unable to figure it out without SSHing into instance and checking the logs. I have however created a small system after first few crashes which send us an email whenever script crashes due to some uncaught exception and also when script completes execution.
However, we need to develop a better system where we can monitor the progress of script via web interface of our admin application which is deployed over some other instance and also trigger start/stop of script via this interface. What are possible options for achieving this?
If you like to stay in Node.js, then there are several process monitoring tools:
PM2 comes with lots of other features besides monitoring processes. You can monitor your processes via CLI or their official web interface: https://keymetrics.io/. A quick search on npm also gives a bunch of nice unofficial gui tools: https://www.npmjs.com/search?q=pm2+web
Forever is not as feature rich as PM2 but will do the basic process operations and couple of gui are also available in npm.
There are two problems here that you are trying to solve:
Scheduling work to be done
Monitoring a process for failure
At a simple level, this is easy: schedule a cron job and restart failed things so they keep trying.
However, when things don't go smoothly, it helps to have a lot more granularity over what you are scheduling, and how it is executed. This would also give you the visibility over each little piece of work.
Adding a little more complexity, you can end up with something like this:
Schedule the script that starts everything (via cron, if that's comfortable)
That script generates several jobs that need to be executed into a queue
A worker process (or n worker processes) consume that queue and execute pending jobs
You can monitor both the progress of the jobs, as well as the state of each worker (# of crashes, failures, jobs completed, etc.). The other tools mentioned above are good candidates for this (forever, pm2, etc.)
When jobs fail, other workers can pick up the small piece of work that was in progress and restart it. This is much more efficient than restarting the entire process, and also lets you parallelize things across n workers based on how you can split up the workloads.
You could easily throw the status onto a web app so you can check in periodically rather than have to dig through server logs.
You can also get more intelligent with different types of failures. Network error? Retry 5 times. Rated limited? Gradual back-off. Crash? Don't retry and notify via email. etc
I have tried this with pm2, you can get the info of the task, then cat out or grab the log files. Or you could have a logging server, see also: https://github.com/papertrail/remote_syslog2

Shorting the time between process crash and shooting server in the head?

I have a routine that crashes linux and force a reboot using a system function.
Now I have the problem that I need to crash linux when a certain process dies. Using a script starting the process and if the script ends restart the server is not appropriate since it takes some ms.
Another idea is spawning the shooting processes alongside and use polling of a counter and if the counter is not incremented reboot the server would be another idea.
This would result in an almost instant reaction.
Now the question is what would be a good timeframe. I have no idea how the scheduler of linux would guarantee a certain update of any such counter and what a good timeout would be.
Also I would like to hear some alternatives to this second process spawning. Is there a possibility to advice linux to run a certain routine in case of a crash of the given process or a listener meachanism for the even of problems with a given process?
The timeout idea is already implemented in the kernel. You can register any application as a software watchdog, but you'll have to lower the default timeout. Have a look at http://linux.die.net/man/8/watchdog for some ideas. That application can also handle user-defined tests. Realistically unless you're trying to run kernel like linux-rt, having timeouts lower than 100ms can be dangerous on a system with heavy load - especially if the check needs to poll another application.
In cases of application crashes, you can handle them if your init supports notifications. For example both upstart and systemd can do that by monitoring files (make sure coredumps are created in the right place).
But in any case, I'd suggest rethinking the idea of millisecond-resolution restarts. Do you really need to kill the system in that time, or do you just need to isolate it? Just syncing the disks will take a few extra milliseconds and you probably don't want to miss that step. Instead of just killing the host, you could make sure the affected app isn't working (SIGABRT?) and kill all networking (flush iptables, change default to DROP).

Heroku node timeout because of enormous task

Our node app gets quite big and one job takes quite some time to execute. We run this job with a cronjob, but by calling the URL. Now Heroku has problems with this, because the job takes more than 30 seconds to finish. So we receive a time-out and after that it tries to execute it immediately again, and again, till our Memory quota is about 300% and the app crashes.
Now I want to fix this. Locally we don't have any problems running this script at all. It takes about a minute (for now, but in the future if we have more users it may take more time) to finish and memory stays stable.
Now running this script on the background should fix the problem according https://devcenter.heroku.com/articles/request-timeout#debugging-request-timeouts
Overe here https://devcenter.heroku.com/articles/asynchronous-web-worker-model-using-rabbitmq-in-node#getting-started I read about JackRabbit. But it seems like it's used for systems like RabbitMQ https://github.com/hunterloftis/jackrabbit
So my question: anyone who has experience with background tasks in node? Can and should I use JackRabbit for my background tasks, or are there better solutions? My background task just contains a very complex ExpressJS task, which takes some time to execute so....
I'm the Node.js platform owner at Heroku (and I actually wrote the web worker article you referenced).
Your use case sounds like it may fit the scheduler very well:
https://devcenter.heroku.com/articles/scheduler
It's a great replacement for cron-type jobs.

Don't let deployments cancel file uploads

We have an application that accepts file uploads from the user.
Whenever we deploy our application we stop the application process and start it again. All lengthy processing is done before we actually stop the application so the actual downtime is fairly small (a few seconds).
However, when stopping the process we also kill active requests to our application (i.e. file uploads).
What would be a good way to handle this? I have a few ideas:
Extract the file upload handler into a separate service?
Make the restart more "intelligent" and tell the processes to not accept any new requests and wait for the currently active requests to stop before killing the process
You've just listed two of essentially three solutions I can think of :-)
The third would be a multi-tier deployment with a smart load balancer and deploy process smart enough to know what node to restart and when.
If it is a smaller scale app with no significant impact, I would go to what seems to me simpler version: track active downloads and monitor this on restart. Maintain just one app, you know? But it makes the upload logic more complex.
However, if the uploads are important enough, and they seem to be, it may be worth it to extract it to a separate service. Not just because of new deploy, but also to protect you from unexpected crashes and shutdowns. You would then have to decide a way to communicate completed uploads from the service though, and also handle the client response etc.
On my view, one app to maintain and deploy is simpler then two, but of course also a bit less robust.
So the answer really depends on your needs and resources, right?

Maintaining a long-running task on Linux

My system includes a task which opens a network socket, receives pushed data from the network, processes it, and writes it out to disk or pings other machines depending on the messages. This task is intended to run forever, and the service is designed to have this task always running. But sometimes it crashes.
What's the best practice for keeping a task like this alive? Assume it's okay for the task to be dead for up to 30 seconds before we restart it.
Some obvious ideas include having a watchdog process that checks to make sure the process is still running. Watchdog could be triggered by cron. But how does it know if the process is alive or not? Write a pidfile? touch a heartbeat file? An ideal solution wouldn't continuously spin up more processes if the machine gets bogged down to the point where the watchdog is running faster than the heartbeat.
Are there standard linux tools for this? I can imagine a solution that uses a message queue, but I'm not sure if that's a good idea or not.
Depending on the nature of the task that you wish to monitor, one method is to write a simple wrapper to start up your task in a fork().
The wrapper task can then do a waitpid() on the child and restart it if it is terminated.
This does depend on modifying the source for the task that you wish to run.
sysvinit will restart processes that die, if added to inittab.
If you're worried about the process freezing without crashing and ending the process, you can use a heartbeat and hard kill the active instance, letting init restart it.
You could use monit along with daemonize. There are lots of tools for this in the *nix world.
Supervisor was designed precisely for this task. From the project website:
Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.
It runs as a daemon (supervisord) controlled by a command line tool, supervisorctl. The configuration file contains a list of programs it is supposed to monitor, among other settings.
The number of options is quite extensive, -- have a look at the docs for a complete list. In your case, the relevant configuration section might be something like this:
[program:my-network-task]
command=/bin/my-network-task # where your binary lives
autostart=true # start when supervisor starts?
autorestart=true # restart automatically when stopped?
startsecs=10 # consider start successful after how many secs?
startretries=3 # try starting how many times?
I have used Supervisor myself and it worked really well once everything was set up. It requires Python, which should not be a big deal in most environments but might be.

Resources