Upstart task hangs after it finishes successfully - linux

I've got an Upstart task that starts multiple instances of a service based on Starting multiple upstart instances automatically and Restarting Upstart instance processes. It's working and it starts all instances but after it successfully starts them it just hangs. If I Ctrl-C out and then check the instances with either service status or looking in ps they're all successfully started, so I don't know what it's doing when it's hanging.
Here's my script:
description "all-my-workers"
start on runlevel [2345]
task
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
pre-start script
for i in `seq 1 $NUM_INSTANCES`;
do
start my-worker N=$i PORT=$(($STARTING_PORT + $i))
done
end script
When I do service start all-my-workers I get this:
vagrant#vagrant-service:/etc/init$ sudo service all-my-workers start
And then it just hangs there and doesn't prompt me again. As I said I can Ctrl-C out and see the running workers:
vagrant#vagrant-service:/etc/init$ sudo service all-my-workers status
all-my-workers start/running
vagrant#vagrant-service:/etc/init$ sudo service my-worker status N=1
my-worker (1) start/running, process 21938
And in ps:
worker 21938 0.0 0.1 4392 612 ? Ss 21:46 0:00 /bin/sh -e /proc/self/fd/9
worker 21941 0.2 7.3 174076 27616 ? Sl 21:46 0:00 python /var/lib/my-system/script/start_worker.py
I don't think the problem is in the my-worker.conf but just in case:
description "my-worker"
stop on stopping all-my-workers
setuid worker
setgid worker
respawn
instance $N
console log
env SCRIPT_PATH="/var/lib/my-system/script/"
script
export PROVIDER=vagrant
export REGION=all
export ENVIRONMENT=cert
. /var/lib/my-system/.virtualenvs/my-system/bin/activate
python $SCRIPT_PATH/start_worker.py
END
end script
Thanks a bunch!

How Do I Fix It?
I'm going to assume that my-worker is a long-lived process, and you want to have any easy way to spin up & tear down multiple parallel instances of my-worker.
If this is the case, you probably don't want all-my-workers to be a task. You'd want the following instead:
description "all-my-workers"
start on runlevel [2345]
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
pre-start script
for i in `seq 1 $NUM_INSTANCES`;
do
start my-worker N=$i PORT=$(($STARTING_PORT + $i))
done
end script
pre-stop script
for i in `seq 1 $NUM_INSTANCES`;
do
stop my-worker N=$i PORT=$(($STARTING_PORT + $i)) || true
done
end script
Then you can run start all-my-workers to start all of the my-worker instances and then run stop all-my-workers to stop them. Effectively, all-my-workers becomes a parent job that manages starting and stoping it's child jobs.
Why?
You cited two SO answers showing this idea of a parent job managing child jobs. They show:
A task with a script stanza
A job with a pre-start stanza
Your parent job is a task with a pre-start stanza, and that's why you're running into this odd behavior.
script vs pre-start
From this Ask Ubuntu answer which cites this deprecated documentation, there are two very important statements (with emphasis added):
All job files must have either an exec or script stanza. This specifies what will be run for the job.
Additional shell code can be given to be run before or after the binary or script specified with exec or script. These are not expected to start the process, in fact, they can't. They are intended for preparing the environment and cleaning up afterwards.
In summary, any background processes spawned by the pre-start stanza are ignored (i.e., not monitored) by Upstart. Instead, you must use exec or script to spawn a process which Upstart will monitor.
What happens if you omit the exec/script stanza? Upstart will sit and wait for a process to be spawned. Thus, you might as well have written a while-true loop:
script
while true; do
true
done
end script
The only difference is that the while-true loop is a live-lock whereas an empty stanza results in a dead-lock.
Jobs vs. Tasks
Knowing the above, the Upstart documentation for tasks finally leads us to what's going on:
Without the 'task' keyword, the events that cause the job to start will be unblocked as soon as the job is started. This means the job has emitted a starting(7) event, run its pre-start, begun its script/exec, and post-start, and emitted its started(7) event.
With task, the events that lead to this job starting will be blocked until the job has completely transitioned back to stopped. This means that the job has run up to the previously mentioned started(7) event, and has also completed its post-stop, and emitted its stopped(7) event.
(Some of the specifics about events and states will make more sense if you read the documenation about starting and stopping jobs).
In simpiler terms:
With a normal Upstart job, the exec/script stanza is expected to block indefinitely because it's launching a long-lived process. Thus, Upstart stops blocking once it has finished the pre-start stanza.
With a task, the exec/script stanza is expected to block for a "finite" period because it's launching a short-lived process. Thus, Ubstart blocks until after the exec/script stanza has completed.
But what happens if there is no exec/script stanza? Upstart sits and waits indefinitely for something to be launched, but that's never going to happen.
In the case of a job, that's fine because Upstart doesn't block while waiting for a process to spawn, and calling stop is apparently enough to make it stop waiting.
In the case of a task, though, Upstart will just sit and hang forever -- or until you interrupt it. However, because it still hasn't found a spawned process, it is still technically running. That's is why you're able to query the status after interrupting and see all-my-workers start/running.
For Interest's Sake
If, for some reason, you really wanted to make your parent job into a task, you would actually need two tasks: one to start the my-worker instances and one to stop them. You would also need to delete the stop on stopping all-my-workers stanza from my-worker.
start-all-my-workers:
description "starts all-my-workers"
start on runlevel [2345]
task
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
script
for i in `seq 1 $NUM_INSTANCES`;
do
start my-worker N=$i PORT=$(($STARTING_PORT + $i))
done
end script
stop-all-my-workers:
description "stops all-my-workers"
start on runlevel [!2345]
task
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
script
for i in `seq 1 $NUM_INSTANCES`;
do
stop my-worker N=$i PORT=$(($STARTING_PORT + $i)) || true
done
end script

Related

How do I stop a scirpt running in the background in linux?

Let's say I have a silly script:
while true;do
touch ~/test_file
sleep 3
done
And I start the script into the background and leave the terminal:
chmod u+x silly_script.sh
./silly_script.sh &
exit
Is there a way for me to identify and stop that script now? The way I see it is, that every command is started in it's own process and I might be able to catch and kill one command like the 'sleep 3' but not the execution of the entire script, am I mistaken? I expected a process to appear with the scripts name, but it does not. If I start the script with 'source silly_script.sh' I can't find a process by the name of 'source'. Do I need to identify the instance of bash, that is executing the script? How would I do that?
EDIT: There have been a few creative solutions, but so far they require the PID of the script execution to be stored right away, or the bash session to not be left with ^D or exit. I understand, that this way of running scripts should maybe be avoided, but I find it hard to believe, that any low privilege user could, even by accident, start an annoying script into the background, that is for instance filling the drive with garbage files or repeatedly starting new instances of some software and even the admin has no other option, than to restart the server, because a simple script can hide it's identifier without even trying.
With the help of the fine people here I was able to derive the answer I needed:
It is true, that the script runs every command in it's own process, so for instance killing the sleep 3 command won't do anything to the script being run, but through a command like the sleep 3 you can find the bash instance running the script, by looking for the parent process:
So after doing the above, you can run ps axf to show all processes in a tree form. You will then find this section:
18660 ? S 0:00 /bin/bash
18696 ? S 0:00 \_ sleep 3
Now you have found the bash instance, that is running the script and can stop it: kill 18660
(Of course your PID will be different from mine)
The jobs command will show you all running background jobs.
You can kill background jobs by id using kill, e.g.:
$ sleep 9999 &
[1] 58730
$ jobs
[1]+ Running sleep 9999 &
$ kill %1
[1]+ Terminated sleep 9999
$ jobs
$
58730 is the PID of the backgrounded task, and 1 is the task id of it. In this case kill 58730 and kill %1` would have the same effect.
See the JOB CONTROL section of man bash for more info.
When you exit, the backgrounded job will get a kill signal and die (assuming that's how it handles the signal - in your simple example it is), unless you disown it first.
That kill will propogate to the sleep process, which may well ignore it and continue sleeping. If this is the case you'll still see it in ps -e output, but with a parent pid of 1 indicating its original parent no longer exists.
You can use ps -o ppid= <pid> to find the parent of a process, or pstree -ap to visualise the job hierarchy and find the parent visually.

How to debug an upstart script that intermittently fails?

I have a process that I want to start as soon my system is rebooted by whatever means so I was using upstart script for that but sometimes what I am noticing is my process doesn't get started up during hard reboot (plugging off and starting the machine) so I think my upstart script is not getting kicked in after hard reboot. I believe there is no runlevel for Hard Reboot.
I am confuse that why sometimes during reboot it works, but sometimes it doesn't work. And how can I debug this out?
Below is my upstart script:
# sudo start helper
# sudo stop helper
# sudo status helper
start on runlevel [2345]
stop on runlevel [!2345]
chdir /data
respawn
pre-start script
echo "[`date`] Agent Starting" >> /data/agent.log
sleep 30
end script
post-stop script
echo "[`date`] Agent Stopping" >> /data/agent.log
sleep 30
end script
limit core unlimited unlimited
limit nofile 100000 100000
setuid goldy
exec python helper.py
Is there any way to debug this out what's happening? I can easily reproduce this I believe. Any pointers on what I can do here?
Note:
During reboot sometimes I see the logging that I have in pre-start script but sometimes I don't see the logging at all after reboot and that means my upstart script was not triggered. Is there anything I need to change on runlevel to make it work?
I have a VM which is running in a Hypervisor and I am working with Ubuntu.
Your process running nicely, BUT during system startup many things go parallel.
IF mount (which makes available the /data folder) runs later than your pre-start script you will not see the "results" of pre-start script.
I suggest to move sleep 30 earlier (BTW 30 secs seems too looong):
pre-start script
sleep 30 # sleep 10 should be enough
echo "[`date`] Agent Starting" >> /data/agent.log
end script

systemd: SIGTERM immediately after start

I am trying systemd for the first time. I want to start a process at system bootup. And I have a problem in getting it up and running.
systemd should run a script (start.sh). This script starts a processes (lets call it P) in the background and exits with code 0.
P keeps running forever till a signal happends.
If I run start.sh manually all is ok.
If I let it start by systemd P gets immediately after the start a SIGTERM and terminates.
So it get started but what about the signal??
It terminates P and I am not sure whats its origin and the reason for it.
Maybe my unit is wrong but I have no idea how to set it for my needs.
I tried service-type simple, idle and oneshot.
Thanks for help!
Chris
Here is my unit.
[Unit]
Description=Test
After=sshd.service
[Service]
Type=oneshot
ExecStart=/home/max/start.sh start
Restart=no
User=root
SuccessExitStatus=0
[Install]
WantedBy=multi-user.target
Thats the status.
Loaded: loaded (/etc/systemd/system/test.service; enabled)
Active: inactive (dead) since Die 2016-02-23 20:56:59 CET; 20min ago
Process: 1046 ExecStart=/home/max/test.sh start (code=exited, status=0/SUCCESS)
When start.sh finishes, systemd kills everything in the same cgroup as start.sh
Your options are:
setting KillMode in the Unit section to process (the default is control-group). That will cause systemd to only kill the process which it directly fired.
to not make start.sh start something in the background and exit but to execute it right there in the foreground
I think in your situation option 2 is viable and more straightforward.
Source: https://unix.stackexchange.com/a/231201/45329
Although changing the KillMode to process like below will work in your situation, it is not the recommended solution.
[Service]
KillMode=process
...
The problem with KillMode set to process is that systemd loses control over all the children of the process it started. That means, if anything happens and one of your processes does not die for some reason, it will continue to linger around.
A better solution in your situation would be to create all the processes, keep their pid and then wait on them.
The wait command that you use in your shell script may vary depending on which shell you are using (the link I proposed is for bash). Having the shell script wait for all the children is in effect the same as starting one child, which does not get detached, in the foreground.
So something like this, more or less:
#!/bin/bash
# Start your various processes
process1 &
PROCESS1_PID=$!
process2 &
PROCESS2_PID=$!
process3 &
PROCESS3_PID=$!
# Wait on your processes
wait $PROCESS1_PID $PROCESS2_PID $PROCESS3_PID
# OR, if I'm correct, bash also allows you to wait on all children
# with just a plain wait like so:
wait
# reach here only after children 1, 2, and 3 died

upstart conf file wait for process

I have an upstart script that looks like this:
description "my script"
# Start and stop runlevels
start on runlevel [2345]
stop on runlevel [!2345]
# Automatically respawn
respawn
respawn limit 15 5
script
exec /home/myscript.sh
end script
This script launches a vpn, so in the process it takes some time to come up,when I see the logs on /var/log/upstart/myscript.sh I see that the process is being launched continually so it never finishes launching. What can I do to make upstart wait for the process to finish launching ?
Your upstart script has a respawn stansa but no expect stansa. Perhaps upstart is tracking the wrong process.
The documentation has the following warning "If you are creating a new Job Configuration File, do not specify the respawn stanza until you are fully satisfied you have specified the expect stanza correctly. If you do, you will find the behaviour potentially very confusing."
http://upstart.ubuntu.com/cookbook/#respawn
http://upstart.ubuntu.com/cookbook/#how-to-establish-fork-count

Can upstart expect/respawn be used on processes that fork more than twice?

I am using upstart to start/stop/automatically restart daemons. One of the daemons forks 4 times. The upstart cookbook states that it only supports forking twice. Is there a workaround?
How it fails
If I try to use expect daemon or expect fork, upstart uses the pid of the second fork. When I try to stop the job, nobody responds to upstarts SIGKILL signal and it hangs until you exhaust the pid space and loop back around. It gets worse if you add respawn. Upstart thinks the job died and immediately starts another one.
Bug acknowledged by upstream
A bug has been entered for upstart. The solutions presented are stick with the old sysvinit, rewrite your daemon, or wait for a re-write. RHEL is close to 2 years behind the latest upstart package, so by the time the rewrite is released and we get updated the wait will probably be 4 years. The daemon is written by a subcontractor of a subcontractor of a contractor so it will not be fixed any time soon either.
I came up with an ugly hack to make this work. It works for my application on my system. YMMV.
start the application in the pre-start section
in the script section run a script that runs as long as the application runs. The pid of this script is what upstart will track.
in the post-stop section kill the application
example
env DAEMON=/usr/bin/forky-application
pre-start script
su -s /bin/sh -c "$DAEMON" joeuseraccount
end script
script
sleepWhileAppIsUp(){
while pidof $1 >/dev/null; do
sleep 1
done
}
sleepWhileAppIsUp $DAEMON
end script
post-stop script
if pidof $DAEMON;
then
kill `pidof $DAEMON`
#pkill $DAEMON # post-stop process (19300) terminated with status 1
fi
end script
a similar approach could be taken with pid files.

Resources