Process is not respawning when an associated post-stop process is killed - linux

I have a process xyz, whose upstart script is as below
description "Run the xyz daemon"
author "xyz"
import SETTINGS
start on (
start-ap-services SETTINGS
)
stop on (
stop-ap-services SETTINGS
) or stopping system-services
respawn
oom score 0
script
. /usr/share/settings.sh
# directory for data persisted between boots
XYZ_DIR="/var/lib/xyz"
mkdir -p "${XYZ_DIR}"
chown -R xyz:xyz "${XYZ_DIR}"
if [ "${SETTING}" = 1 ]
ARGS="$ARGS --enable_stats=true"
fi
# CAP_NET_BIND_SERVICE, CAP_DAC_OVERRIDE.
exec /sbin/minijail0 -p -c 0x0402 -u xyz -g xyz \
-G /usr/bin/xyz ${ARGS}
else
exec sleep inf
fi
end script
# Prevent the job from respawning too quickly.
post-stop exec sleep 3
Now, due to OOM issue. xyz is killed based on it's OOM score and it gets respawned as expected. After a several restart of xyz, the post-stop sleep is killed after which xyz is never respawned.
How can this be prevented or Is there any solution to this?
Note: Name xyz is a dummy process name used only to mention my actual doubt.
I haven't worked on upstart scripts before. Any help would be of greater help.

Upstart can get confused when post-stop, pre-start, and post-start sections remain running across respawns.
I prefer to keep any command that takes longer than a few hundred milliseconds in a main job section, using auxiliary jobs if necessary.
For example, this will stall a job xyz that is being respawned or otherwise stopped:
start on stopping xyz RESULT='ok'
task
exec sleep 3
This has the same effect as your post-stop stanza, except that Upstart can better handle the state tracking for the simplified main job.

Related

How to debug an upstart script that intermittently fails?

I have a process that I want to start as soon my system is rebooted by whatever means so I was using upstart script for that but sometimes what I am noticing is my process doesn't get started up during hard reboot (plugging off and starting the machine) so I think my upstart script is not getting kicked in after hard reboot. I believe there is no runlevel for Hard Reboot.
I am confuse that why sometimes during reboot it works, but sometimes it doesn't work. And how can I debug this out?
Below is my upstart script:
# sudo start helper
# sudo stop helper
# sudo status helper
start on runlevel [2345]
stop on runlevel [!2345]
chdir /data
respawn
pre-start script
echo "[`date`] Agent Starting" >> /data/agent.log
sleep 30
end script
post-stop script
echo "[`date`] Agent Stopping" >> /data/agent.log
sleep 30
end script
limit core unlimited unlimited
limit nofile 100000 100000
setuid goldy
exec python helper.py
Is there any way to debug this out what's happening? I can easily reproduce this I believe. Any pointers on what I can do here?
Note:
During reboot sometimes I see the logging that I have in pre-start script but sometimes I don't see the logging at all after reboot and that means my upstart script was not triggered. Is there anything I need to change on runlevel to make it work?
I have a VM which is running in a Hypervisor and I am working with Ubuntu.
Your process running nicely, BUT during system startup many things go parallel.
IF mount (which makes available the /data folder) runs later than your pre-start script you will not see the "results" of pre-start script.
I suggest to move sleep 30 earlier (BTW 30 secs seems too looong):
pre-start script
sleep 30 # sleep 10 should be enough
echo "[`date`] Agent Starting" >> /data/agent.log
end script

Check if process runs if not execute script.sh

I am trying to find a way to monitor a process. If the process is not running it should be checked again to make sure it has really crashed. If it has really crashed run a script (start.sh)
I have tried monit with no succes, I have also tried adding this script in crontab: I made it executable with chmod +x monitor.sh
the actual program is called program1
case "$(pidof program | wc -w)" in
0) echo "Restarting program1: $(date)" >> /var/log/program1_log.txt
/home/user/files/start.sh &
;;
1) # all ok
;;
*) echo "Removed double program1: $(date)" >> /var/log/program1_log.txt
kill $(pidof program1 | awk '{print $1}')
;;
esac
The problem is this script does not work, I added it to crontab and set it to run every 2 minutes. If I close the program it won't restart.
Is there any other way to check a process, and run start.sh when it has crashed?
Not to be rude, but have you considered a more obvious solution?
When a shell (e.g. bash or tcsh) starts a subprocess, by default it waits for that subprocess to complete.
So why not have a shell that runs your process in a while(1) loop? Whenever the process terminates, for any reason, legitimate or not, it will automatically restart your process.
I ran into this same problem with mythtv. The backend keeps crashing on me. It's a Heisenbug. Happens like once a month (on average). Very hard to track down. So I just wrote a little script that I run in an xterm.
The, ahh, oninter business means that control-c will terminate the subprocess and not my (parent-process) script. Similarly, the sleep is in there so I can control-c several times to kill the subprocess and then kill the parent-process script while it's sleeping...
Coredumpsize is limited just because I don't want to fill up my disk with corefiles that I cannot use.
#!/bin/tcsh -f
limit coredumpsize 0
while( 1 )
echo "`date`: Running mythtv-backend"
# Now we cannot control-c this (tcsh) process...
onintr -
# This will let /bin/ls directory-sort my logfiles based on day & time.
# It also keeps the logfile names pretty unique.
mythbackend |& tee /....../mythbackend.log.`date "+%Y.%m.%d.%H.%M.%S"`
# Now we can control-c this (tcsh) process.
onintr
echo "`date`: mythtv-backend exited. Sleeping for 30 seconds, then restarting..."
sleep 30
end
p.s. That sleep will also save you in the event your subprocess dies immediately. Otherwise the constant respawning without delay will drive your IO and CPU through the roof, making it difficult to correct the problem.

How can I make a working upstart job with yas3fs?

I've got a very simple upstart config for maintaining a yas3fs mount.
start on filesystem
stop on runlevel [!2345]
respawn
kill timeout 15
oom never
expect fork
script
. /etc/s3.env
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
exec /opt/yas3fs/yas3fs.py /mnt/something --url=s3://something --cache-path=/mnt/s3fs-cache --mp-size=5120 --mp-num=8
end script'
What happens is that I get two copies of yas3fs.py running. One appears to mount the s3 bucket correctly, but the other is CONSTANTLY respawned by upstart (presumably because it errors due to the other one running).
If I throw in an "expect fork", the job never starts correctly. I just want to be able to have this simple mount safely able to be restarted, stopped, etc as an upstart job. Ideas?
I'm not an upstart expert, but this script should work:
start on (filesystem and net-device-up IFACE=eth0)
stop on runlevel [!2345]
env S3URL="s3://BUCKET[/PREFIX]"
env MOUNTPOINT="/SOME/PATH"
respawn
kill timeout 15
oom never
script
MOUNTED=$(mount|grep " $MOUNTPOINT "|wc -l)
if [ $MOUNTED = "1" ]; then
umount "$MOUNTPOINT"
fi
exec /opt/yas3fs/yas3fs.py "$MOUNTPOINT" --url="$S3URL" --mp-size=5120 --mp-num=8 -f
end script
pre-stop script
umount "$MOUNTPOINT"
end script
The trick is to leave yas3fs in foreground with the '-f' option, it seems there are too many forks to manage otherwise.
I added a check to clean (i.e. unmount) the mount point if yas3fs dies in some wrong way (e.g. "kill -9").

Upstart: Error when using command substitution in post-start script stanza during startup sequence

I'm seeing an issue in upstart where using command substitution inside a post-start script stanza causes an error (syslog reports "terminated with status 1"), but only during the initial system startup.
I've tried using just about every startup event hook under the sun. local-filesystems and net-device-up worked without error about 1/100 tries, so it looks like a race condition. It works just fine on manual start/stop. The command substitutions I've seen trigger the error are a simple cat or date, and I've tried using both the $() way and the backtick way. I've also tried using sleep in pre-start to beat the race condition but that did nothing.
I'm running Ubuntu 11.10 on VMWare with a Win7 host. Spent too many hours troubleshooting this already... Anyone got any ideas?
Here is my .conf file for reference:
start on runlevel [2345]
stop on runlevel [016]
env NODE_ENV=production
env MYAPP_PIDFILE=/var/run/myapp.pid
respawn
exec start-stop-daemon --start --make-pidfile --pidfile $MYAPP_PIDFILE --chuid node-svc --exec /usr/local/n/versions/0.6.14/bin/node /opt/myapp/live/app.js >> /var/log/myapp/audit.node.log 2>&1
post-start script
MYAPP_PID=`cat $MYAPP_PIDFILE`
echo "[`date -u +%Y-%m-%dT%T.%3NZ`] + Started $UPSTART_JOB [$MYAPP_PID]: PROCESS=$PROCESS UPSTART_EVENTS=$UPSTART_EVENTS" >> /var/log/myapp/audit.upstart.log
end script
post-stop script
MYAPP_PID=`cat $MYAPP_PIDFILE`
echo "[`date -u +%Y-%m-%dT%T.%3NZ`] - Stopped $UPSTART_JOB [$MYAPP_PID]: PROCESS=$PROCESS UPSTART_STOP_EVENTS=$UPSTART_STOP_EVENTS EXIT_SIGNAL=$EXIT_SIGNAL EXIT_STATUS=$EXIT_STATUS" >> /var/log/myapp/audit.upstart.log
end script
The most likely scenario I can think of is that $MYAPP_PIDFILE has not been created yet.
Because you have not specified an 'expect' stanza, the post-start is run as soon as the main process has forked and execed. So, as you suspected, there is probably a race between start-stop-daemon running node and writing that pidfile and /bin/sh forking, execing, and forking again to exec cat $MYAPP_PIDFILE.
The right way to do this is to rewrite your post-start as such:
post-start script
for i in 1 2 3 4 5 ; do
if [ -f $MYAPP_PIDFILE ] ; then
echo ...
exit 0
fi
sleep 1
done
echo "timed out waiting for pidfile"
exit 1
end script
Its worth noting that in Upstart 1.4 (included first in Ubuntu 12.04), upstart added logging ability, so there's no need to redirect output into a special log file. All console output defaults to /var/log/upstart/$UPSTART_JOB.log (which is rotated by logrotate). So those echos could just be bare echos.

Linux Daemon Stopping start-stop-daemon

I have a daemon I am creating in linux. I created the init.d file and have successfully started the daemon process using
/etc/init.d/mydaemon start
When I try to stop it(with /etc/init.d/mydaemon stop), however, it stops successfully, but start-stop-daemon never seems to complete as evidenced by no echos occuring immediately after the call to start-stop-daemon
Verbose mode shows that it stopped the process, and looking at system monitor, it does stop the process.
Stopped mydaemon (pid 13292 13310).
Here is my stop function of the init.d file.
do_stop()
{
# Return
# 0 if daemon has been stopped
# 1 if daemon was already stopped
# 2 if daemon could not be stopped
# other if a failure occurred
start-stop-daemon --stop --name $NAME -v
echo "stopped"#This is never printed and the script never formally gives shell back.
RETVAL="$?"
[ "$RETVAL" = 2 ] && return 2
# Wait for children to finish too if this is a daemon that forks
# and if the daemon is only ever run from this initscript.
# If the above conditions are not satisfied then add some other code
# that waits for the process to drop all resources that could be
# needed by services started subsequently. A last resort is to
# sleep for some time.
start-stop-daemon --stop --quiet --oknodo --retry=0/30/KILL/5 --exec $DAEMON
[ "$?" = 2 ] && return 2
# Many daemons don't delete their pidfiles when they exit.
return "$RETVAL"
}
I am running this on virtual machine, does this affect anything?
Running on a virtual machine shouldn't affect this.
And I have no idea why this is happening or how it is taking over control of the parent script.
However, I just encountered this issue and discovered that if I do:
start-stop-daemon ... && echo -n
it will work as expected and relinquish control of the shell.
I have no idea why this works, but it seems to work.

Resources