Upstart tracking wrong PID of Bluepill - linux

I have bluepill setup to monitor my delayed_job processes.
Using Ubuntu 12.04.
I am starting and monitoring the bluepill service itself using Ubuntu's upstart. My upstart config is below (/etc/init/bluepill.conf).
description "Start up the bluepill service"
start on runlevel [2]
stop on runlevel [016]
expect fork
exec sudo /home/deploy/.rvm/wrappers/<app_name>/bluepill load /home/deploy/websites/<app_name>/current/config/server/staging/delayed_job.bluepill
# Restart the process if it dies with a signal
# or exit code not given by the 'normal exit' stanza.
respawn
I have also tried with expect daemon instead of expect fork. I have also tried removing the expect... line completely.
When the machine boots, bluepill starts up fine.
$ ps aux | grep blue
root 1154 0.6 0.8 206416 17372 ? Sl 21:19 0:00 bluepilld: <app_name>
The PID of the bluepill process is 1154 here. But upstart seems to be tracking the wrong PID.
$ initctl status bluepill
bluepill start/running, process 990
This is preventing the bluepill process from getting respawned if I forcefully kill bluepill using kill -9.
Moreover, I think because of the wrong PID being tracked, reboot / shutdown just hangs and I have to hard reset the machine every time.
What could be the issue here?

Clearly, upstart tracks the wrong PID. From looking at the bluepill source code, it uses the daemons gem to daemonize, which in turn forks twice. So expect daemon in the upstart config should track the correct PID -- but you've already tried that.
If it is possible for you, you should run bluepill in the foreground, and not use any expect stanza at all in your upstart config.
From the bluepill documentation:
Bluepill.application("app_name", :foreground => true) do |app|
# ...
end
will run bluepill in the foreground.

Related

how to run process from batch script

I have simple batch script in linux debian - Debian GNU/Linux 6.0 - that stop process then deletes log files and start the process again :
#!/bin/bash
killall -KILL rsyslogd
sleep 5s
rm /var/log/syslog
rm /var/log/messages
rm /var/log/kern.log
sleep 3s
rsyslogd
exit
The process name is rsyslogd. I have to close it before deleting the log files, for linux to empty the space from disk.
I see that killall -KILL closes the process by its name, but what is the opposite - the run command?
Calling it by its name without any command seems to not work. I will be glad for any tips, thank you.
Debian uses systemd to manage processes. You should, therefore, use the systemd's commands to stop and start rsyslogd.
systemctl stop rsyslog
and
systemctl start rsyslog
If you are using really old versions of Debian (so old that you should upgrade), it may be possible that sys V is still used. In that case, there is a file under /etc/init.d which is called rc.rsyslog or something comparable (use ls /etc/init.d to find the exact name). In that case, it would be
sudo /etc/init.d/rc.rsyslog stop
and
sudo /etc/init.d/rc.rsyslog start
Or it may be, that your systemd-package may be broken. In that case, the package can be re-installed:
apt-get --reinstall install systemd
To start rsyslogd:
systemctl start rsyslog
To stop it:
systemctl stop rsyslog
If you want to do both, use
systemctl restart rsyslog

systemd: SIGTERM immediately after start

I am trying systemd for the first time. I want to start a process at system bootup. And I have a problem in getting it up and running.
systemd should run a script (start.sh). This script starts a processes (lets call it P) in the background and exits with code 0.
P keeps running forever till a signal happends.
If I run start.sh manually all is ok.
If I let it start by systemd P gets immediately after the start a SIGTERM and terminates.
So it get started but what about the signal??
It terminates P and I am not sure whats its origin and the reason for it.
Maybe my unit is wrong but I have no idea how to set it for my needs.
I tried service-type simple, idle and oneshot.
Thanks for help!
Chris
Here is my unit.
[Unit]
Description=Test
After=sshd.service
[Service]
Type=oneshot
ExecStart=/home/max/start.sh start
Restart=no
User=root
SuccessExitStatus=0
[Install]
WantedBy=multi-user.target
Thats the status.
Loaded: loaded (/etc/systemd/system/test.service; enabled)
Active: inactive (dead) since Die 2016-02-23 20:56:59 CET; 20min ago
Process: 1046 ExecStart=/home/max/test.sh start (code=exited, status=0/SUCCESS)
When start.sh finishes, systemd kills everything in the same cgroup as start.sh
Your options are:
setting KillMode in the Unit section to process (the default is control-group). That will cause systemd to only kill the process which it directly fired.
to not make start.sh start something in the background and exit but to execute it right there in the foreground
I think in your situation option 2 is viable and more straightforward.
Source: https://unix.stackexchange.com/a/231201/45329
Although changing the KillMode to process like below will work in your situation, it is not the recommended solution.
[Service]
KillMode=process
...
The problem with KillMode set to process is that systemd loses control over all the children of the process it started. That means, if anything happens and one of your processes does not die for some reason, it will continue to linger around.
A better solution in your situation would be to create all the processes, keep their pid and then wait on them.
The wait command that you use in your shell script may vary depending on which shell you are using (the link I proposed is for bash). Having the shell script wait for all the children is in effect the same as starting one child, which does not get detached, in the foreground.
So something like this, more or less:
#!/bin/bash
# Start your various processes
process1 &
PROCESS1_PID=$!
process2 &
PROCESS2_PID=$!
process3 &
PROCESS3_PID=$!
# Wait on your processes
wait $PROCESS1_PID $PROCESS2_PID $PROCESS3_PID
# OR, if I'm correct, bash also allows you to wait on all children
# with just a plain wait like so:
wait
# reach here only after children 1, 2, and 3 died

How to launch a process outside a systemd control group

I have a server process (launched from systemd) which can launch an update process. The update process self-daemonizes itself and then (in theory) kills the server with SIGTERM. My problem is that the SIGTERM propagates to the update process and it's children.
For debugging purposes, the update process just sleeps, and I send the kill by hand.
Sample PS output before the kill:
1 1869 1869 1869 ? -1 Ss 0 0:00 /usr/local/bin/state_controller --start
1869 1873 1869 1869 ? -1 Sl 0 0:00 \_ ProcessWebController --start
1869 1886 1869 1869 ? -1 Z 0 0:00 \_ [UpdateSystem] <defunct>
1 1900 1900 1900 ? -1 Ss 0 0:00 /bin/bash /usr/local/bin/UpdateSystem refork /var/ttm/update.bin
1900 1905 1900 1900 ? -1 S 0 0:00 \_ sleep 10000
Note that UpdateSystem is in a separate PGID and TPGID. (The <defunct> process is a result of the daemonization, and is not (I think) a problem.)
UpdateSystem is a bash script (although I can easily make it a C program if that will help). After the daemonization code taken from https://stackoverflow.com/a/29107686/771073, the interesting bit is:
#############################################
trap "echo Ignoring SIGTERM" SIGTERM
sleep 10000
echo Awoken from sleep - presumably by the SIGTERM
exit 0
When I kill 1869 (which sends SIGTERM to the state_controller server process, my logfile contains:
Terminating
Ignoring SIGTERM
Awoken from sleep - presumably by the SIGTERM
I really want to prevent SIGTERM being sent to the sleep process.
(Actually, I really want to stop it being sent to apt-get upgrade which is stopping the system via the moral equivalent of systemctl stop ttm.service and the ExecStop is specified as /bin/kill $MAINPID - just in case that changes anyone's answer.)
This question is similar, but the accepted answer (use KillMode=process) doesn't work well for me - I want to kill some of the child processes, just not the update process:
Can't detach child process when main process is started from systemd
A completely different approach is for the upgrade process to remove itself from the service group by updating the /sys/fs/cgroup/systemd filesystem. Specifically in bash:
echo $$ > /sys/fs/cgroup/systemd/tasks
A process belongs to exactly one control group. Writing its PID to the root tasks file adds it to the other control group, and removes it from the service control group.
We were having exactly the same problem. What we ended up doing is launching the update process as transient cgroup with systemd-run:
systemd-run --unit=my_system_upgrade --scope --slice=my_system_upgrade_slice -E setsid nohup start-the-upgrade &> /tmp/some-logs.log &
That way, the update process will run in a different cgroup and will not be terminated. Additionally, we use setsid + nohup to make sure the process has its own group and session and that the parent process is the init process.
The approach we have decided to take is to launch the update process in a separate (single-shot) service. As such, it automatically belongs to a separate control group, so killing the main service doesn't kill it.
There is a wrinkle to this though. The package installs ttm.service and ttm.template.update.service. To run the updater, we copy ttm.template.update.service to ttm.update.service, run systemctl daemon-reload, and then run systemctl start ttm.update.service. Why the copy? Because when the updater installs a new version of ttm.template.update.service, it will forcibly terminate any processes running as that service. KillMode=None appears to offer a way round that, but although it appears to work, a subsequent call to apt-get yields a nasty error about dpkg having been interrupted.
Are you sure it is not systemd sending the TERM signal to the child process?
Depending on the service type, if your main process dies, systemd will do a cleanup and terminate all the child processes under the same cgroup.
This is defined by KillMode= property which is by default set to control-group. You could set it to "none" or "process". https://www.freedesktop.org/software/systemd/man/systemd.kill.html
I have same sitation with you.
Upgrade process is a child process of parent process. The parent process is call by a service.
The main point is not Cgroup, is MAINPID.
If you use PIDFILE to sepecify the MAINPID, when the service type = forking, then the situation solved.
[Service]
Type=forking
PIDFile=/run/test.pid

Ubuntu upstart gets incorrect PID from Play 1.3

The Upstart script using the start-stop-daemon we've been using for Play 1.2.7 is now unable to stop/restart Play since Play 1.3 due to it having an incorrect PID.
Framework version: 1.3.0 on Ubuntu 12.04.5 LTS
Reproduction steps:
Setup an upstart script (playframework.conf) for a Play application
Play application starts successfully on server reboot Run 'sudo
status playframework' will return playframework start/running,
process 28912 - At this point process 28912 doesn't exist
vi {playapplicationfolder}/server.pid shows 28927
'stop playframework'
then fails due to unknown pid 28912 'status playframework' results in
playframework stop/killed, process 28912
Only way to restart play framework after this point is to either find the actual process and kill it then start play using the usual 'play start' command manually. Or restart the server.
This has broken our deployments scripts now as we used to install the new version of our app, then do play restart before reconnecting to the load balancer.
Upstart Script:
#Upstart script for a play application that binds to an unprivileged user.
# put this into a file like /etc/init/playframework
# you can then start/stop it using either initctl or start/stop/restart
# e.g.
# start playframework
description "PlayApp"
author "-----"
version "1.0"
env PLAY_BINARY=/opt/play/play
env JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
env HOME=/opt/myapp/latest
env USER=ubuntu
env GROUP=admin
env PROFILE=prod
start on (filesystem and net-device-up IFACE=lo) or runlevel [2345]
stop on runlevel [!2345]
limit nofile 65536 65536
respawn
respawn limit 10 5
umask 022
expect fork
pre-start script
test -x $PLAY_BINARY || { stop; exit 0; }
test -c /dev/null || { stop; exit 0; }
chdir ${HOME}
rm ${HOME}/server.pid || true
/opt/configurer.sh
end script
pre-stop script
exec $PLAY_BINARY stop $HOME
end script
post-stop script
rm ${HOME}/server.pid || true
end script
script
exec start-stop-daemon --start --exec $PLAY_BINARY --chuid $USER:$GROUP --chdir $HOME -- start $HOME -javaagent:/opt/newrelic/newrelic.jar --%$PROFILE -Dprecompiled=true --http.port=8080 --https.port=4443
end script
We've tried specifying the PID file in the start-stop-daemon as per: http://man.he.net/man8/start-stop-daemon however this also didnt seem to have any effect.
I have found some threads on similar issues https://askubuntu.com/questions/319199/upstart-tracking-wrong-pid-of-process-not-respawning but have been unable to find a way round this so far. I have tried changing fork to daemon but the same issue remains. I also can't see what has changed between Play 1.2.7 and 1.3 to cause this.
Another SO post has also asked a similar question but not had an answer as yet: https://stackoverflow.com/questions/23117345/upstart-gets-wrong-pid-after-launching-celery-with-start-stop-daemon
This is because getJavaVersion() spawns a subprocess, which bumps the PID count, which breaks Upstart, the latter which expects Play to fork exactly none, once or twice, depending on which expect stanza you use.
I've fixed this in a pull request.

Can upstart expect/respawn be used on processes that fork more than twice?

I am using upstart to start/stop/automatically restart daemons. One of the daemons forks 4 times. The upstart cookbook states that it only supports forking twice. Is there a workaround?
How it fails
If I try to use expect daemon or expect fork, upstart uses the pid of the second fork. When I try to stop the job, nobody responds to upstarts SIGKILL signal and it hangs until you exhaust the pid space and loop back around. It gets worse if you add respawn. Upstart thinks the job died and immediately starts another one.
Bug acknowledged by upstream
A bug has been entered for upstart. The solutions presented are stick with the old sysvinit, rewrite your daemon, or wait for a re-write. RHEL is close to 2 years behind the latest upstart package, so by the time the rewrite is released and we get updated the wait will probably be 4 years. The daemon is written by a subcontractor of a subcontractor of a contractor so it will not be fixed any time soon either.
I came up with an ugly hack to make this work. It works for my application on my system. YMMV.
start the application in the pre-start section
in the script section run a script that runs as long as the application runs. The pid of this script is what upstart will track.
in the post-stop section kill the application
example
env DAEMON=/usr/bin/forky-application
pre-start script
su -s /bin/sh -c "$DAEMON" joeuseraccount
end script
script
sleepWhileAppIsUp(){
while pidof $1 >/dev/null; do
sleep 1
done
}
sleepWhileAppIsUp $DAEMON
end script
post-stop script
if pidof $DAEMON;
then
kill `pidof $DAEMON`
#pkill $DAEMON # post-stop process (19300) terminated with status 1
fi
end script
a similar approach could be taken with pid files.

Resources