nohup "does not work" MPIrun - linux

I am trying to use the "nohup" command to avoid killing a background process when exiting the terminal on linux MATE.
The process I want to run is a MPIrun process and I use the following command:
nohup mpirun -np 8 solverName -parallel >log 2>&1
when I leave the terminal, the processes running on the different cores are killed.
Also another thing I remarked in the log file, is that if I try to just run the following command
mpirun -np 8 solverName -parallel >log 2>&1
and then to CTRL+Z (stopping the process) the log file indicates :
Forwarding signal 20 to job
and I am unable to actually stop the mpirun command. So I guess there is something I don't understand in what I am doing

The job run in the background is still owned by your login shell (the nohup command doesn't exit until the mpirun command terminates), so it gets signalled when you disconnect. This script (I call it bk) is what I use:
#!/bin/sh
#
# #(#)$Id: bk.sh,v 1.9 2008/06/25 16:43:25 jleffler Exp $"
#
# Run process in background
# Immune from logoffs -- output to file log
(
echo "Date: `date`"
echo "Command: $*"
nice nohup "$#"
echo "Completed: `date`"
echo
) >>${LOGFILE:=log} 2>&1 &
(If you're into curiosities, note the careful use of $* and "$#". The nice runs the job at a lower priority when I'm not there. And version 1.1 was checked into version control — SCCS at the time — on 1987-08-10.)
For your process, you'd run:
$ bk mpirun -np 8 solverName -parallel
$
The prompt returns almost immediately. The key differences between what is in that code and what you do direct from the command line are:
There's a sub-process for the shell script, which terminates promptly.
The script itself runs the command in a sub-shell in background.
Between them, these mean that the process is not interfered with by your login shell; it doesn't know about the grandchild process.
Running direct on the command line, you'd write:
(nohup mpirun -np 8 solverName -parallel >log 2>&1 &)
The parentheses start a subshell; the sub-shell runs nohup in the background with I/O redirection and terminates. The continuing command is a grandchild of your login shell and is not interfered with by your login shell.
I'm not an expert in mpirun, never having used it, so there's a chance it does something I'm not expecting. My impression from the manual page is that it acts more or less like a regular process even though it can run multiple other processes, possibly on multiple nodes. That is, it runs the other processes but monitors and coordinates them and only exits when its children are complete. If that's correct, then what I've outlined is accurate enough.

To kill the process you need the following command.
first:
$ jobs -l
this gives you the PID of the process like this
[1]+ 47274 Running nohup mpirun -np 8 solverName -parallel >log 2>&1
then execute the following command to kill the process.
kill -9 {program PID i.e 47274 }
this will help you with killing the process.
note that ctrl+Z does not kill the process but it suspends it.
for the first part of the question, I recommend to try this command and see if it works or not.
nohup nohup mpirun -n 8 --your_flags ./compited_solver_name > Output.txt &
it worked for me.
tell us if it doesn't work for you.

Related

How to run gdb on httpd processes within a shell script

I would like to get all my httpd processes, put in an array, then run gdb on each process, run as a cron, save output to file. For instance:
#!/bin/bash
# Make a list of current httpd pid's and then run "gdb" on each one
pids=( $(pgrep 'httpd') )
for each in "${pids[#]}"
do
echo "$each"
gdb httpd $each >> gdbscipt.out
echo "Done with: $each"
done
When I run it just runs on the first pid.
# ./gdbscript
2046
Then just stops after each pid is processed. Because it seems there is a breakpoint? within gdb after processing each pid.
I want to run it overnight a few times via cron.
Is there a better approach to running gdb on a list of active httpd processes via cron and outputting to a file(s)?
Thanks

Ending an mpirun process terminates a bash loop

I'm trying to schedule a series of mpi jobs on an Ubuntu 14.04 LTS machine using a bash script. Basically, I want a simulation to run on every core for a certain amount of time, then terminate and move on to the next case once that time has elapsed.
My issue arises when mpi exits at the end of the first job - it breaks the loop and returns the terminal to my control instead of heading onto the next iteration of the loop.
My script is included below. The file "case_names" is just a text file of directory names. I've tested the script with other commands and it works fine until I uncomment the mpirun call.
#!/bin/bash
while read line;
do
# Access case dierctory
cd $line
echo "Case $line accessed"
# Start simulation
echo "Case $line starting: $(date)"
mpirun -q -np 8 dsmcFoamPlus -parallel > log.dsmcFoamPlus &
# Wait for 10 hour runtime
sleep 36000
# Kill job
pkill mpirun > /dev/null
echo "Case $line terminated: $(date)"
# Return to parent directory
cd ..
done < case_names
Does anyone know of a way to stop mpirun from breaking the loop like this?
So far I've tried GNOME task scheduler and task-spooler, but neither have worked (likely due to aliases that have to be invoked before the commands I use become available). I'd really rather not have to resort to setting up slurm. I've also tried using the disown command to separate the mpi process from the shell I'm running the scheduling script in, and have even written a separate script just to kill processes which the scheduling script runs remotely.
Many thanks in advance!
I've managed to find a workaround that allows me to schedule tasks with a bash script like I wanted. Since this solves my issue, I'm posting it as an answer (although I would still welcome an explanation as to why mpi behaves in this way in loops).
The solution lay in writing a separate script for both calling and then killing mpi, which would itself be called by the scheduling script. Since this child bash process has no loops in it, there are no issues with mpi breaking them after being killed. Also, once this script has exited, the scheduling loop can continue unimpeded.
My (now working) code is included below.
Scheduling script:
while read line;
do
cd $line
echo "CWD: $(pwd)"
echo "Case $line accessed"
bash ../run_job
echo "Case $line terminated: $(date)"
cd ..
done < case_names
Execution script (run_job):
mpirun -q -np 8 dsmcFoamPlus -parallel > log.dsmcFoamPlus &
echo "Case $line starting: $(date)"
sleep 600
pkill mpirun
I hope someone will find this useful.

Kill ssh or\and remote process from bash script

I am trying to run the following command as part of the bash script which suppose to open ssh channel, run the program on the remote machine, save the output to the file for 10 sec, kill the process, which was writing to the file and then give the control back to bash script.
#!/bin/bash
ssh hostname '/root/bin/nodes-listener > /tmp/nodesListener.out </dev/null; sshpid=!$; sleep 10; kill -9 $sshpid 2>/dev/null &'
Unfortunately, what it seems to be doing is starting the program: nodes-listener remotely, but it never gets any further and it doesn't give control to the bash script. So, the only way to stop the execution is to do Ctrl+C.
Killing ssh doesn't help (or rather can't be executed) since the control is not with bash script as it waits for the command within the ssh session to complete, which of course never happens as it has to be killed to stop.
Here's the command line that you're running on the remote system:
/root/bin/nodes-listener > /tmp/nodesListener.out </dev/null
sshpid=!$
sleep 10
kill -9 $sshpid 2>/dev/null &
You should change it to this:
/root/bin/nodes-listener > /tmp/nodesListener.out </dev/null & <-- Ampersand goes here
sshpid=$!
sleep 10
kill -9 $sshpid 2>/dev/null
You want to start nodes-listener and then kill it after ten seconds. To do this, you need to start nodes-listener as a background process, so that the shell which is executing this command line to move on to the next command after starting nodes-listener. The & in your command line is in the wrong place, and would apply only to the kill command. You need to apply it to the nodes-listener command.
I'll also note that your sshpid=!$ line was incorrect. You want sshpid=$!. $! is the process ID of the last command started in the background.
You need to place the ampersand after the first command, then put the remaining commands onto the next line:
ssh hostname -- '/root/bin/nodes-listener > /tmp/nodesListener.out </dev/null &
sshpid=$!; sleep 10; kill $sshpid 2>/dev/null'
Btw, ssh is returning after all commands had been executed. This does mean it will close the allocated pty as well. If there are still background jobs running in that shell session, they would being killed by SIGHUP. This means, you can probably omit the explicit kill command. (Depends on whether nodes-listener handles SIGHUP and SIGTERM differently). Having this, you could simplify the code to the following:
ssh hostname -- sh -c '/root/bin/nodes-listener > /tmp/nodesListener.out </dev/null &
sleep 10'
I have resolved this by pushing the shell script to the remote machine and executing it there. It is actually less tidy and relies on space being available on the remote computer.
Since my remote machine is a small physical device, the issue of the space usage is important (even for the tiny amount of space required in this case).
/root/bin/nodes-listener > /tmp/nodesListener.out </dev/null &
sshpid=!$
sleep 20
sync
# killing nodes-listener process and giving control back to the base bash
killall -9 nodes-listener 2>/dev/null && echo "nodes-listener is killed"

How can I launch a new process that is NOT a child of the original process?

(OSX 10.7) An application we use let us assign scripts to be called when certain activities occur within the application. I have assigned a bash script and it's being called, the problem is that what I need to do is to execute a few commands, wait 30 seconds, and then execute some more commands. If I have my bash script do a "sleep 30" the entire application freezes for that 30 seconds while waiting for my script to finish.
I tried putting the 30 second wait (and the second set of commands) into a separate script and calling "./secondScript &" but the application still sits there for 30 seconds doing nothing. I assume the application is waiting for the script and all child processes to terminate.
I've tried these variations for calling the second script from within the main script, they all have the same problem:
nohup ./secondScript &
( ( ./secondScript & ) & )
( ./secondScript & )
nohup script -q /dev/null secondScript &
I do not have the ability to change the application and tell it to launch my script and not wait for it to complete.
How can I launch a process (I would prefer the process to be in a scripting language) such that the new process is not a child of the current process?
Thanks,
Chris
p.s. I tried the "disown" command and it didn't help either. My main script looks like this:
[initial commands]
echo Launching second script
./secondScript &
echo Looking for jobs
jobs
echo Sleeping for 1 second
sleep 1
echo Calling disown
disown
echo Looking again for jobs
jobs
echo Main script complete
and what I get for output is this:
Launching second script
Looking for jobs
[1]+ Running ./secondScript &
Sleeping for 1 second
Calling disown
Looking again for jobs
Main script complete
and at this point the calling application sits there for 45 seconds, waiting for secondScript to finish.
p.p.s
If, at the top of the main script, I execute "ps" the only thing it returns is the process ID of the interactive bash session I have open in a separate terminal window.
The value of $SHELL is /bin/bash
If I execute "ps -p $$" it correctly tells me
PID TTY TIME CMD
26884 ?? 0:00.00 mainScript
If I execute "lsof -p $$" it gives me all kinds of results (I didn't paste all the columns here assuming they aren't relevant):
FD TYPE NAME
cwd DIR /private/tmp/blahblahblah
txt REG /bin/bash
txt REG /usr/lib/dyld
txt REG /private/var/db/dyld/dyld_shared_cache_x86_64
0 PIPE
1 PIPE -> 0xffff8041ea2d10
2 PIPE -> 0xffff 8017d21cb
3r DIR /private/tmp/blahblah
4r REG /Volumes/DATA/blahblah
255r REG /Volumes/DATA/blahblah
The typical way of doing this in Unix is to double fork. In bash, you can do this with
( sleep 30 & )
(..) creates a child process, and & creates a grandchild process. When the child process dies, the grandchild process is inherited by init.
If this doesn't work, then your application is not waiting for child processes.
Other things it may be waiting for include the session and open lock files:
To create a new session, Linux has a setsid. On OS X, you might be able to do it through script, which incidentally also creates a new session:
# Linux:
setsid sleep 30
# OS X:
nohup script -q -c 'sleep 30' /dev/null &
To find a list of inherited file descriptors, you can use lsof -p yourpid, which will output something like:
sleep 22479 user 0u CHR 136,32 0t0 35 /dev/pts/32
sleep 22479 user 1u CHR 136,32 0t0 35 /dev/pts/32
sleep 22479 user 2u CHR 136,32 0t0 35 /dev/pts/32
sleep 22479 user 5w REG 252,0 0 1048806 /tmp/lockfile
In this case, in addition to the standard FDs 0, 1 and 2, you also have a fd 5 open with a lock file that the parent can be waiting for.
To close fd 5, you can use exec 5>&-. If you think the lock file might be stdin/stdout/stderr themselves, you can use nohup to redirect them to something else.
Another way is to abandon the child
#!/bin/bash
yourprocess &
disown
As far as I understand, the application replaces the normal bash shell because it is still waiting for a process to finish even if init should have taken care of this child process.
It could be that the "application" intercepts the orphan handling which is normally done by init.
In that case, only a parallel process with some IPC can offer a solution (see my other answer)
I think it depends on how your parent process tries to detect if your child process has been finished.
In my case (my parent process was gnu make), I succeed by closing stdout and stderr (slightly based on the answer of that other guy) like this:
sleep 30 >&- 2>&- &
You might also close stdin
sleep 30 <&- >&- 2>&- &
or additionally disown your child process (not for Mac)
sleep 30 <&- >&- 2>&- & disown
Currently tested only in bash on kubuntu 14.04 and Mac OSX.
If all else fails:
Create a named pipe
start the "slow" script independent from the "application", make sure executes it's task in an endless loop, starting with reading from the pipe. It will become read-blocked when it tries to read..
from the application, start your other script. When it needs to invoke the "slow" script, just write some data to the pipe. The slow script will start independently so your script won't wait for the "slow" script to finish.
So, to answer the question:
bash - how can I launch a new process that is NOT a child of the original process?
Simple: don't launch it but let an independent entity launch it during boot...like init or on the fly with the command at or batch
Here I have a shell
└─bash(13882)
Where I start a process like this:
$ (urxvt -e ssh somehost&)
I get a process tree (this output snipped from pstree -p):
├─urxvt(14181)───ssh(14182)
where the process is parented beneath pid 1 (systemd in my case).
However, had I instead done this (note where the & is) :
$ (urxvt -e ssh somehost)&
then the process would be a child of the shell:
└─bash(13882)───urxvt(14181)───ssh(14182)
In both cases the shell prompt is immediately returned and I can exit
without terminating the process tree that I started above.
For the latter case the process tree is reparented beneath pid 1 when
the shell exits, so it ends up the same as the first example.
├─urxvt(14181)───ssh(14182)
Either way, the result is a process tree that outlives the shell. The
only difference is the initial parenting of that process tree.
For reference, you can also use
nohup urxvt -e ssh somehost &
urxvt -e ssh somehost & disown $!
Both give the same process tree as the second example above.
└─bash(13882)───urxvt(14181)───ssh(14182)
When the shell is terminated the process tree is, like before, reparented
to pid 1.
nohup additionally redirects the process' standard output to a file
nohup.out so, if that is a useful trait, it may be a more useful choice.
Otherwise, with the first form above, you immediately have a completely
detached process tree.

background jobs change to daemon without nohup/disown?

a strange thing to me
a script while.sh,it's content is:
while [ 1 ];do
sleep 1
echo `date`
done
run as $while.sh >& while.log & (without nohup or disown or setsid or double fork())
exit and login again can see this process is still exist,it's ppid is 1 and it's tty is ?
my system is rhel6(rhel5 is the same, bash
in centos5.x it must use nohup or disown or do double fork() in code
what happen in rhel6
Is the huponexit shell option set?
$ shopt
...
huponexit off
Bash will send a SIGHUP signal to its jobs if it receives a SIGHUP itself, but it won't signal them when it exits normally unless you enable this option.
For what it's worth this is disabled on both RHEL6 and RHEL5, at least on the systems I just tested. I tried this command:
$ sleep 1000 &
It was not killed when I logged out and logged back in unless I deliberately enabled shopt -s huponexit.

Resources