upstart conf file wait for process - linux

I have an upstart script that looks like this:
description "my script"
# Start and stop runlevels
start on runlevel [2345]
stop on runlevel [!2345]
# Automatically respawn
respawn
respawn limit 15 5
script
exec /home/myscript.sh
end script
This script launches a vpn, so in the process it takes some time to come up,when I see the logs on /var/log/upstart/myscript.sh I see that the process is being launched continually so it never finishes launching. What can I do to make upstart wait for the process to finish launching ?

Your upstart script has a respawn stansa but no expect stansa. Perhaps upstart is tracking the wrong process.
The documentation has the following warning "If you are creating a new Job Configuration File, do not specify the respawn stanza until you are fully satisfied you have specified the expect stanza correctly. If you do, you will find the behaviour potentially very confusing."
http://upstart.ubuntu.com/cookbook/#respawn
http://upstart.ubuntu.com/cookbook/#how-to-establish-fork-count

Related

How to debug an upstart script that intermittently fails?

I have a process that I want to start as soon my system is rebooted by whatever means so I was using upstart script for that but sometimes what I am noticing is my process doesn't get started up during hard reboot (plugging off and starting the machine) so I think my upstart script is not getting kicked in after hard reboot. I believe there is no runlevel for Hard Reboot.
I am confuse that why sometimes during reboot it works, but sometimes it doesn't work. And how can I debug this out?
Below is my upstart script:
# sudo start helper
# sudo stop helper
# sudo status helper
start on runlevel [2345]
stop on runlevel [!2345]
chdir /data
respawn
pre-start script
echo "[`date`] Agent Starting" >> /data/agent.log
sleep 30
end script
post-stop script
echo "[`date`] Agent Stopping" >> /data/agent.log
sleep 30
end script
limit core unlimited unlimited
limit nofile 100000 100000
setuid goldy
exec python helper.py
Is there any way to debug this out what's happening? I can easily reproduce this I believe. Any pointers on what I can do here?
Note:
During reboot sometimes I see the logging that I have in pre-start script but sometimes I don't see the logging at all after reboot and that means my upstart script was not triggered. Is there anything I need to change on runlevel to make it work?
I have a VM which is running in a Hypervisor and I am working with Ubuntu.
Your process running nicely, BUT during system startup many things go parallel.
IF mount (which makes available the /data folder) runs later than your pre-start script you will not see the "results" of pre-start script.
I suggest to move sleep 30 earlier (BTW 30 secs seems too looong):
pre-start script
sleep 30 # sleep 10 should be enough
echo "[`date`] Agent Starting" >> /data/agent.log
end script

systemd: SIGTERM immediately after start

I am trying systemd for the first time. I want to start a process at system bootup. And I have a problem in getting it up and running.
systemd should run a script (start.sh). This script starts a processes (lets call it P) in the background and exits with code 0.
P keeps running forever till a signal happends.
If I run start.sh manually all is ok.
If I let it start by systemd P gets immediately after the start a SIGTERM and terminates.
So it get started but what about the signal??
It terminates P and I am not sure whats its origin and the reason for it.
Maybe my unit is wrong but I have no idea how to set it for my needs.
I tried service-type simple, idle and oneshot.
Thanks for help!
Chris
Here is my unit.
[Unit]
Description=Test
After=sshd.service
[Service]
Type=oneshot
ExecStart=/home/max/start.sh start
Restart=no
User=root
SuccessExitStatus=0
[Install]
WantedBy=multi-user.target
Thats the status.
Loaded: loaded (/etc/systemd/system/test.service; enabled)
Active: inactive (dead) since Die 2016-02-23 20:56:59 CET; 20min ago
Process: 1046 ExecStart=/home/max/test.sh start (code=exited, status=0/SUCCESS)
When start.sh finishes, systemd kills everything in the same cgroup as start.sh
Your options are:
setting KillMode in the Unit section to process (the default is control-group). That will cause systemd to only kill the process which it directly fired.
to not make start.sh start something in the background and exit but to execute it right there in the foreground
I think in your situation option 2 is viable and more straightforward.
Source: https://unix.stackexchange.com/a/231201/45329
Although changing the KillMode to process like below will work in your situation, it is not the recommended solution.
[Service]
KillMode=process
...
The problem with KillMode set to process is that systemd loses control over all the children of the process it started. That means, if anything happens and one of your processes does not die for some reason, it will continue to linger around.
A better solution in your situation would be to create all the processes, keep their pid and then wait on them.
The wait command that you use in your shell script may vary depending on which shell you are using (the link I proposed is for bash). Having the shell script wait for all the children is in effect the same as starting one child, which does not get detached, in the foreground.
So something like this, more or less:
#!/bin/bash
# Start your various processes
process1 &
PROCESS1_PID=$!
process2 &
PROCESS2_PID=$!
process3 &
PROCESS3_PID=$!
# Wait on your processes
wait $PROCESS1_PID $PROCESS2_PID $PROCESS3_PID
# OR, if I'm correct, bash also allows you to wait on all children
# with just a plain wait like so:
wait
# reach here only after children 1, 2, and 3 died

Upstart task hangs after it finishes successfully

I've got an Upstart task that starts multiple instances of a service based on Starting multiple upstart instances automatically and Restarting Upstart instance processes. It's working and it starts all instances but after it successfully starts them it just hangs. If I Ctrl-C out and then check the instances with either service status or looking in ps they're all successfully started, so I don't know what it's doing when it's hanging.
Here's my script:
description "all-my-workers"
start on runlevel [2345]
task
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
pre-start script
for i in `seq 1 $NUM_INSTANCES`;
do
start my-worker N=$i PORT=$(($STARTING_PORT + $i))
done
end script
When I do service start all-my-workers I get this:
vagrant#vagrant-service:/etc/init$ sudo service all-my-workers start
And then it just hangs there and doesn't prompt me again. As I said I can Ctrl-C out and see the running workers:
vagrant#vagrant-service:/etc/init$ sudo service all-my-workers status
all-my-workers start/running
vagrant#vagrant-service:/etc/init$ sudo service my-worker status N=1
my-worker (1) start/running, process 21938
And in ps:
worker 21938 0.0 0.1 4392 612 ? Ss 21:46 0:00 /bin/sh -e /proc/self/fd/9
worker 21941 0.2 7.3 174076 27616 ? Sl 21:46 0:00 python /var/lib/my-system/script/start_worker.py
I don't think the problem is in the my-worker.conf but just in case:
description "my-worker"
stop on stopping all-my-workers
setuid worker
setgid worker
respawn
instance $N
console log
env SCRIPT_PATH="/var/lib/my-system/script/"
script
export PROVIDER=vagrant
export REGION=all
export ENVIRONMENT=cert
. /var/lib/my-system/.virtualenvs/my-system/bin/activate
python $SCRIPT_PATH/start_worker.py
END
end script
Thanks a bunch!
How Do I Fix It?
I'm going to assume that my-worker is a long-lived process, and you want to have any easy way to spin up & tear down multiple parallel instances of my-worker.
If this is the case, you probably don't want all-my-workers to be a task. You'd want the following instead:
description "all-my-workers"
start on runlevel [2345]
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
pre-start script
for i in `seq 1 $NUM_INSTANCES`;
do
start my-worker N=$i PORT=$(($STARTING_PORT + $i))
done
end script
pre-stop script
for i in `seq 1 $NUM_INSTANCES`;
do
stop my-worker N=$i PORT=$(($STARTING_PORT + $i)) || true
done
end script
Then you can run start all-my-workers to start all of the my-worker instances and then run stop all-my-workers to stop them. Effectively, all-my-workers becomes a parent job that manages starting and stoping it's child jobs.
Why?
You cited two SO answers showing this idea of a parent job managing child jobs. They show:
A task with a script stanza
A job with a pre-start stanza
Your parent job is a task with a pre-start stanza, and that's why you're running into this odd behavior.
script vs pre-start
From this Ask Ubuntu answer which cites this deprecated documentation, there are two very important statements (with emphasis added):
All job files must have either an exec or script stanza. This specifies what will be run for the job.
Additional shell code can be given to be run before or after the binary or script specified with exec or script. These are not expected to start the process, in fact, they can't. They are intended for preparing the environment and cleaning up afterwards.
In summary, any background processes spawned by the pre-start stanza are ignored (i.e., not monitored) by Upstart. Instead, you must use exec or script to spawn a process which Upstart will monitor.
What happens if you omit the exec/script stanza? Upstart will sit and wait for a process to be spawned. Thus, you might as well have written a while-true loop:
script
while true; do
true
done
end script
The only difference is that the while-true loop is a live-lock whereas an empty stanza results in a dead-lock.
Jobs vs. Tasks
Knowing the above, the Upstart documentation for tasks finally leads us to what's going on:
Without the 'task' keyword, the events that cause the job to start will be unblocked as soon as the job is started. This means the job has emitted a starting(7) event, run its pre-start, begun its script/exec, and post-start, and emitted its started(7) event.
With task, the events that lead to this job starting will be blocked until the job has completely transitioned back to stopped. This means that the job has run up to the previously mentioned started(7) event, and has also completed its post-stop, and emitted its stopped(7) event.
(Some of the specifics about events and states will make more sense if you read the documenation about starting and stopping jobs).
In simpiler terms:
With a normal Upstart job, the exec/script stanza is expected to block indefinitely because it's launching a long-lived process. Thus, Upstart stops blocking once it has finished the pre-start stanza.
With a task, the exec/script stanza is expected to block for a "finite" period because it's launching a short-lived process. Thus, Ubstart blocks until after the exec/script stanza has completed.
But what happens if there is no exec/script stanza? Upstart sits and waits indefinitely for something to be launched, but that's never going to happen.
In the case of a job, that's fine because Upstart doesn't block while waiting for a process to spawn, and calling stop is apparently enough to make it stop waiting.
In the case of a task, though, Upstart will just sit and hang forever -- or until you interrupt it. However, because it still hasn't found a spawned process, it is still technically running. That's is why you're able to query the status after interrupting and see all-my-workers start/running.
For Interest's Sake
If, for some reason, you really wanted to make your parent job into a task, you would actually need two tasks: one to start the my-worker instances and one to stop them. You would also need to delete the stop on stopping all-my-workers stanza from my-worker.
start-all-my-workers:
description "starts all-my-workers"
start on runlevel [2345]
task
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
script
for i in `seq 1 $NUM_INSTANCES`;
do
start my-worker N=$i PORT=$(($STARTING_PORT + $i))
done
end script
stop-all-my-workers:
description "stops all-my-workers"
start on runlevel [!2345]
task
console log
env NUM_INSTANCES=1
env STARTING_PORT=42002
script
for i in `seq 1 $NUM_INSTANCES`;
do
stop my-worker N=$i PORT=$(($STARTING_PORT + $i)) || true
done
end script

How can I make a working upstart job with yas3fs?

I've got a very simple upstart config for maintaining a yas3fs mount.
start on filesystem
stop on runlevel [!2345]
respawn
kill timeout 15
oom never
expect fork
script
. /etc/s3.env
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
exec /opt/yas3fs/yas3fs.py /mnt/something --url=s3://something --cache-path=/mnt/s3fs-cache --mp-size=5120 --mp-num=8
end script'
What happens is that I get two copies of yas3fs.py running. One appears to mount the s3 bucket correctly, but the other is CONSTANTLY respawned by upstart (presumably because it errors due to the other one running).
If I throw in an "expect fork", the job never starts correctly. I just want to be able to have this simple mount safely able to be restarted, stopped, etc as an upstart job. Ideas?
I'm not an upstart expert, but this script should work:
start on (filesystem and net-device-up IFACE=eth0)
stop on runlevel [!2345]
env S3URL="s3://BUCKET[/PREFIX]"
env MOUNTPOINT="/SOME/PATH"
respawn
kill timeout 15
oom never
script
MOUNTED=$(mount|grep " $MOUNTPOINT "|wc -l)
if [ $MOUNTED = "1" ]; then
umount "$MOUNTPOINT"
fi
exec /opt/yas3fs/yas3fs.py "$MOUNTPOINT" --url="$S3URL" --mp-size=5120 --mp-num=8 -f
end script
pre-stop script
umount "$MOUNTPOINT"
end script
The trick is to leave yas3fs in foreground with the '-f' option, it seems there are too many forks to manage otherwise.
I added a check to clean (i.e. unmount) the mount point if yas3fs dies in some wrong way (e.g. "kill -9").

Can upstart expect/respawn be used on processes that fork more than twice?

I am using upstart to start/stop/automatically restart daemons. One of the daemons forks 4 times. The upstart cookbook states that it only supports forking twice. Is there a workaround?
How it fails
If I try to use expect daemon or expect fork, upstart uses the pid of the second fork. When I try to stop the job, nobody responds to upstarts SIGKILL signal and it hangs until you exhaust the pid space and loop back around. It gets worse if you add respawn. Upstart thinks the job died and immediately starts another one.
Bug acknowledged by upstream
A bug has been entered for upstart. The solutions presented are stick with the old sysvinit, rewrite your daemon, or wait for a re-write. RHEL is close to 2 years behind the latest upstart package, so by the time the rewrite is released and we get updated the wait will probably be 4 years. The daemon is written by a subcontractor of a subcontractor of a contractor so it will not be fixed any time soon either.
I came up with an ugly hack to make this work. It works for my application on my system. YMMV.
start the application in the pre-start section
in the script section run a script that runs as long as the application runs. The pid of this script is what upstart will track.
in the post-stop section kill the application
example
env DAEMON=/usr/bin/forky-application
pre-start script
su -s /bin/sh -c "$DAEMON" joeuseraccount
end script
script
sleepWhileAppIsUp(){
while pidof $1 >/dev/null; do
sleep 1
done
}
sleepWhileAppIsUp $DAEMON
end script
post-stop script
if pidof $DAEMON;
then
kill `pidof $DAEMON`
#pkill $DAEMON # post-stop process (19300) terminated with status 1
fi
end script
a similar approach could be taken with pid files.

Resources