Erlang: How to make connected external OS process automatically die when controlling Erlang process crashes? - linux

I am using Erlang port to read output of Linux process. I'd like the Linux process to be automatically killed whenever my connected Erlang process dies. From the docs, it seems to me that this should automatically happen, but it does not.
Minimal example. Put this in the file test.erl:
-module(test).
-export([start/0, spawn/0]).
start() ->
Pid = spawn_link(?MODULE, spawn, []),
register(test, Pid).
spawn() ->
Port = open_port({spawn, "watch date"},[stream, exit_status]),
loop([{port, Port}]).
loop(State) ->
receive
die ->
error("died");
Any ->
io:fwrite("Received: ~p~n", [Any]),
loop(State)
end.
Then, in erl shell:
1> c(test).
{ok,test}
2> test:start().
true
The process starts and prints some data received from the Linux "watch" command every 2 seconds.
Then, I make the Erlang process crash:
3> test ! die.
=ERROR REPORT==== 26-May-2021::13:24:01.057065 ===
Error in process <0.95.0> with exit value:
{"died",[{test,loop,1,[{file,"test.erl"},{line,15}]}]}
** exception exit: "died"
in function test:loop/1 (test.erl, line 15)
The Erlang process dies as expected, the data from "watch" stops appearing but the watch process still keeps running in the background as can be seen in Linux (not erl) terminal:
fuxoft#frantisek:~$ pidof watch
1880127
In my real-life scenario, I am not using "watch" command but other process that outputs data and accepts no input. How can I make it automaticall die when my connected Erlang process crashes? I can do this using Erlang supervisor and manually issuing the "kill" command when Erlang process crashes but I thought this could be done easier and cleaner.

The open_port function creates a port() and links it to the calling process. If the owning process dies, the port() closes.
In order to communicate with the externally spawned command, Erlang creates several pipes, which are by default tied to the stdin and stdout (file descriptors) of the external process. Anything that the external process writes through the stdout will arrive as a message to the owning process.
When the Port is closed, the pipes attaching it to the external process are broken, and so trying to read or write to them will give you a SIGPIPE/EPIPE.
You can detect that from your external process when writing or reading from the FDs and exiting the process then.
E.g.: With your current code, you can retrieve the external process OS pid with proplists:get_value(os_pid, erlang:port_info(Port)). If you strace it, you will see:
write(1, ..., 38) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=31297, si_uid=1001} ---
SIGPIPE in ports and Erlang
It seems that although the default action for SIGPIPE is to terminate the process, Erlang sets it to ignore the signal (and thus the children processes inherit this configuration).
If you're unable to modify the external process code to detect the EPIPE, you can use this c wrapper to reset the action:
#include <unistd.h>
#include <signal.h>
int main(int argc, char* argv[]) {
if (signal(SIGPIPE, SIG_DFL) == SIG_ERR)
return 1;
if (argc < 2)
return 2;
execv(argv[1], &(argv[1]));
}
just compile it and run it as wrapper path-to-executable [arg1 [arg2 [...]]] with open_port

Related

Does adding '&' makes it run as a daemon?

I am aware that adding a '&' in the end makes it run as a background but does it also mean that it runs as a daemon?
Like:
celery -A project worker -l info &
celery -A project worker -l info --detach
I am sure that the first one runs in a background however the second as stated in the document runs in the background as a daemon.
I would love to know the main difference of the commands above
They are different!
"&" version is background , but not run as daemon, daemon process will detach with terminal.
in C language ,daemon can write in code :
fork()
setsid()
close(0) /* and /dev/null as fd 0, 1 and 2 */
close(1)
close(2)
fork()
This ensures that the process is no longer in the same process group as the terminal and thus won't be killed together with it. The IO redirection is to make output not appear on the terminal.(see:https://unix.stackexchange.com/questions/56495/whats-the-difference-between-running-a-program-as-a-daemon-and-forking-it-into)
a daemon make it to be in its own session, not be attached to a terminal, not have any file descriptor inherited from the parent open to anything, not have a parent caring for you (other than init) have the current directory in / so as not to prevent a umount... while "&" version do not
Yes the process will be ran as a daemon, or background process; they both do the same thing.
You can verify this by looking at the opt parser in the source code (if you really want to verify this):
. cmdoption:: --detach
Detach and run in the background as a daemon.
https://github.com/celery/celery/blob/d59518f5fb68957b2d179aa572af6f58cd02de40/celery/bin/beat.py#L12
https://github.com/celery/celery/blob/d59518f5fb68957b2d179aa572af6f58cd02de40/celery/platforms.py#L365
Ultimately, the code below is what detaches it in the DaemonContext. Notice the fork and exit calls:
def _detach(self):
if os.fork() == 0: # first child
os.setsid() # create new session
if os.fork() > 0: # pragma: no cover
# second child
os._exit(0)
else:
os._exit(0)
return self
Not really. The process started with & runs in the background, but is attached to the shell that started it, and the process output goes to the terminal.
Meaning, if the shell dies or is killed (or the terminal is closed), that process will be sent a HUG signal and will die as well (if it doesn't catch it, or if its output goes to the terminal).
The command nohup detaches a process (command) from the shell and redirects its I/O, and prevents it from dying when the parent process (shell) dies.
Example:
You can see that by opening two terminals. In one run
sleep 500 &
in the other one run ps -ef to see the list of processes, and near the bottom something like
me 1234 1201 ... sleep 500
^ ^
process id parent process (shell)
close the terminal in which sleep sleeps in the background, and then do a ps -ef again, the sleep process is gone.
A daemon job is usually started by the system (its owner may be changed to a regular user) by upstart or init.

Redirecting stderr, stdout and stdin to pipe is blocking the process from running

I have written a simple application which has two threads. First thread is printing to stdout and second thread is reading from stdin.
I have redirected stdin, stdout and stdin of the process to 3 different pipes as below.
mkfifo pipe_in && mkifo pipe_out && mkfifo pipe_err
./a.out < pipe_in 2> pipe_err 1> pipe_out &
Problem is that I this application (./a.out) is blocked from running until I do below:
cat < pipe_out &
cat < pipe_err &
cat > pipe_in
Why is this application blocked? Is it because no body from other side has opened the pipe?
What is the workaround for the problem where I want to run my application without it being blocked completely. I want only the thread which is waiting for user input to get blocked and other thread to continue execution.
This application is started at bootup. So, this application should be run without getting blocked for user input. User can use anytime "cat > pipe_in" to start providing input to get some details about this application.
Redirection is done by the shell, before starting the application program. Thus a.out does not start, and cannot create any threads that do anything, until the opens of the pipes complete, and the opens of the pipe write sides (for 1 and 2) do not complete until the read sides are opened. (The open of a read side for 0 does complete immediately.)

Does linux kill background processes if we close the terminal from which it has started?

I have an embedded system, on which I do telnet and then I run an application in background:
./app_name &
Now if I close my terminal and do telnet from other terminal and if I check then I can see this process is still running.
To check this I have written a small program:
#include<stdio.h>
main()
{
while(1);
}
I ran this program in my local linux pc in background and I closed the terminal.
Now, when I checked for this process from other terminal then I found that this process was also killed.
My question is:
Why undefined behavior for same type of process?
On which it is dependent?
Is it dependent on version of Linux?
Who should kill jobs?
Normally, foreground and background jobs are killed by SIGHUP sent by kernel or shell in different circumstances.
When does kernel send SIGHUP?
Kernel sends SIGHUP to controlling process:
for real (hardware) terminal: when disconnect is detected in a terminal driver, e.g. on hang-up on modem line;
for pseudoterminal (pty): when last descriptor referencing master side of pty is closed, e.g. when you close terminal window.
Kernel sends SIGHUP to other process groups:
to foreground process group, when controlling process terminates;
to orphaned process group, when it becomes orphaned and it has stopped members.
Controlling process is the session leader that established the connection to the controlling terminal.
Typically, the controlling process is your shell. So, to sum up:
kernel sends SIGHUP to the shell when real or pseudoterminal is disconnected/closed;
kernel sends SIGHUP to foreground process group when the shell terminates;
kernel sends SIGHUP to orphaned process group if it contains stopped processes.
Note that kernel does not send SIGHUP to background process group if it contains no stopped processes.
When does bash send SIGHUP?
Bash sends SIGHUP to all jobs (foreground and background):
when it receives SIGHUP, and it is an interactive shell (and job control support is enabled at compile-time);
when it exits, it is an interactive login shell, and huponexit option is set (and job control support is enabled at compile-time).
See more details here.
Notes:
bash does not send SIGHUP to jobs removed from job list using disown;
processes started using nohup ignore SIGHUP.
More details here.
What about other shells?
Usually, shells propagate SIGHUP. Generating SIGHUP at normal exit is less common.
Telnet or SSH
Under telnet or SSH, the following should happen when connection is closed (e.g. when you close telnet window on PC):
client is killed;
server detects that client connection is closed;
server closes master side of pty;
kernel detects that master pty is closed and sends SIGHUP to bash;
bash receives SIGHUP, sends SIGHUP to all jobs and terminates;
each job receives SIGHUP and terminates.
Problem
I can reproduce your issue using bash and telnetd from busybox or dropbear SSH server: sometimes, background job doesn't receive SIGHUP (and doesn't terminate) when client connection is closed.
It seems that a race condition occurs when server (telnetd or dropbear) closes master side of pty:
normally, bash receives SIGHUP and immediately kills background jobs (as expected) and terminates;
but sometimes, bash detects EOF on slave side of pty before handling SIGHUP.
When bash detects EOF, it by default terminates immediately without sending SIGHUP. And background job remains running!
Solution
It is possible to configure bash to send SIGHUP on normal exit (including EOF) too:
Ensure that bash is started as login shell. The huponexit works only for login shells, AFAIK.
Login shell is enabled by -l option or leading hyphen in argv[0]. You can configure telnetd to run /bin/bash -l or better /bin/login which invokes /bin/sh in login shell mode.
E.g.:
telnetd -l /bin/login
Enable huponexit option.
E.g.:
shopt -s huponexit
Type this in bash session every time or add it to .bashrc or /etc/profile.
Why does the race occur?
bash unblocks signals only when it's safe, and blocks them when some code section can't be safely interrupted by a signal handler.
Such critical sections invoke interruption points from time to time, and if signal is received when a critical section is executed, it's handler is delayed until next interruption point happens or critical section is exited.
You can start digging from quit.h in the source code.
Thus, it seems that in our case bash sometimes receives SIGHUP when it's in a critical section. SIGHUP handler execution is delayed, and bash reads EOF and terminates before exiting critical section or calling next interruption point.
Reference
"Job Control" section in official Glibc manual.
Chapter 34 "Process Groups, Sessions, and Job Control" of "The Linux Programming Interface" book.
When you close the terminal, shell sends SIGHUP to all background processes – and that kills them. This can be suppressed in several ways, most notably:
nohup
When you run program with nohup it catches SIGHUP and redirect program output.
$ nohup app &
disown
disown tells shell not to send SIGHUP
$ app &
$ disown
Is it dependent on version of linux?
It is dependent on your shell. Above applies at least for bash.
AFAIK in both cases the process should be killed. In order to avoid this you have to issue a nohup like the following:
> nohup ./my_app &
This way your process will continue executing. Probably the telnet part is due to a BUG similar to this one:
https://bugzilla.redhat.com/show_bug.cgi?id=89653
In order completely understand whats happening you need to get into unix internals a little bit.
When you are running a command like this
./app_name &
The app_name is sent to background process group. You can check about unix process groups here
When you close bash with normal exit it triggers SIGHUP hangup signal to all its jobs. Some information on unix job control is here.
In order to keep your app running when you exit bash you need to make your app immune to hangup signal with nohup utility.
nohup - run a command immune to hangups, with output to a non-tty
And finally this is how you need to do it.
nohup app_name & 2> /dev/null;
In modern Linux--that is, Linux with systemd--there is an additional reason this might happen which you should be aware of: "linger".
systemd kills processes left running from a login shell, even if the process is properly daemonized and protected from HUP. This is the default behavior in modern configurations of systemd.
If you run
loginctl enable-linger $USER
you can disable this behavior, allowing background processes to keep running. The mechanisms covered by the other answers still apply, however, and you should also protect your process against them.
enable-linger is permanent until it is re-disabled. You can check it with
ls /var/lib/systemd/linger
This may have files, one per username, for users who have enable-linger. Any user listed in the directory has the ability to leave background processes running at logout.

Init script won't "stop" forked C program

I have a C program what has a "daemon" mode, so that I can have it run in the background. When it is run with "-d" it forks using the following code:
if(daemon_mode == 1)
{
int i = fork();
if(i<0) exit(1); // error
if(i>0) exit(0); // parent
}
I created an init script, and when i manually run the init script to start my daemon, it starts ok, however, if i run it with "stop" the daemon isn't stopped.
I imagine the issue is that the PID has changed due to forking, what am I don't wrong and how do I fix it?
If you are using a pid file to control the process, then you are likely correct that changing the pid is causing a problem. Just write the pid file after you have daemonized rather than before.

Bash script: can not properly handle SIGTSTP

I have a bash script that mounts and unmounts a device, which performing some read operations in between. Since the device is very slow, the script takes about 15 seconds to complete (the mount taking atleast 5-6 seconds). Since leaving this device mounted can cause other problems, I don't want this script to be interrupted.
Having said that, I can correctly handle SIGINT (Ctrl+c), but when I try to handle SIGTSTP (Ctrl+z), the script freezes. Which means the signal is trapped but the handler doesn't run.
#!/bin/sh
cleanup()
{
# Don't worry about unmounting yet. Just checking if trap works.
echo "Quitting..." > /dev/tty
exit 0
}
trap 'cleanup' SIGTSTP
...
I manually have to send the KILL signal to the process. Any idea why this is happening and how I can fix it?
The shell does not execute the trap until the currently executing process terminates. (at least, that is the behavior of bash 3.00.15). If you send SIGINT via ^c, it is sent to all processes in the foreground process group; if the program currently executing receives it and terminates then bash can execute the trap. Similarly with SIGTSTP via ^z; bash receives the signal but does not execute the trap until the program that was being run terminates, which it does not do if it takes the default behavior and is suspended. Try replacing ... with a simple read f and note that the trap executes immediately.

Resources