process with openmpi is killed despite nohup command - openmpi

I have used GAMESS for quantum computation in serial mode. I started to use it in parallel mode and it looks successful. But when I close my ssh terminal, the process is killed despite nohup command. In the log file, there is message "mpirun noticed that process rank 1 with PID 2254 on node * exited on signal 1 (Hangup)". How can I solve this problem? :(

Related

Running two node servers with one bash script and receive console logs

I have a bash script:
node web/dist/web/src/app.js & node api/dist/api/src/app.js &
$SHELL
It successfully starts both my node servers. However:
I do not receive any output (from console.log etc) in my terminal window
If I cancel by (Ctrl +C) the processes are not exited, so then I annoyingly have to manually do a taskkill /F /PID etc afterwards.
Is there anyway around this?
The reason you can't stop your background jobs with Ctrl+C is because signals (SIGINT in this case) are received only by the foreground process.
When your foreground process (the non-interactive main script) exits, its children processes become orphans which are immediately adopted by the init process. To kill them, you need their PIDs. (When you run a background process in an interactive shell, it will receive the SIGHUP, and probably exit, when shell exits.)
The solution in your case is to make your script wait for its children, using the shell built-in wait command. wait will ensure your script receives the SIGINT, which you can then handle (with trap) and kill the background jobs (with kill 0):
#!/bin/bash
trap 'kill 0' EXIT
node app1.js &
node app2.js &
wait
By setting trap on EXIT (special pseudo-signal in bash), you'll ensure background processes will terminate whenever your main script exits (either by Ctrl+C/SIGINT, or by any other signal like SIGTERM, SIGHUP, SIGKILL). The kill 0 command kills all processes in the current process group.
Regarding the output -- on Linux, background processes will inherit the standard output/error from shell (if not redirected), and continue to write to your TTY/terminal. If that's not working on Windows, I'm not sure why not.
However, even if your background processes somehow lost their way to your TTY, you can, as a workaround, append to a log file:
node app1.js >>/path/to/file.log 2>&1 &
node app2.js >>/path/to/file.log 2>&1 &
and then tail -f that log file, either in this, or some other terminal:
tail -f /path/to/file.log

Does adding '&' makes it run as a daemon?

I am aware that adding a '&' in the end makes it run as a background but does it also mean that it runs as a daemon?
Like:
celery -A project worker -l info &
celery -A project worker -l info --detach
I am sure that the first one runs in a background however the second as stated in the document runs in the background as a daemon.
I would love to know the main difference of the commands above
They are different!
"&" version is background , but not run as daemon, daemon process will detach with terminal.
in C language ,daemon can write in code :
fork()
setsid()
close(0) /* and /dev/null as fd 0, 1 and 2 */
close(1)
close(2)
fork()
This ensures that the process is no longer in the same process group as the terminal and thus won't be killed together with it. The IO redirection is to make output not appear on the terminal.(see:https://unix.stackexchange.com/questions/56495/whats-the-difference-between-running-a-program-as-a-daemon-and-forking-it-into)
a daemon make it to be in its own session, not be attached to a terminal, not have any file descriptor inherited from the parent open to anything, not have a parent caring for you (other than init) have the current directory in / so as not to prevent a umount... while "&" version do not
Yes the process will be ran as a daemon, or background process; they both do the same thing.
You can verify this by looking at the opt parser in the source code (if you really want to verify this):
. cmdoption:: --detach
Detach and run in the background as a daemon.
https://github.com/celery/celery/blob/d59518f5fb68957b2d179aa572af6f58cd02de40/celery/bin/beat.py#L12
https://github.com/celery/celery/blob/d59518f5fb68957b2d179aa572af6f58cd02de40/celery/platforms.py#L365
Ultimately, the code below is what detaches it in the DaemonContext. Notice the fork and exit calls:
def _detach(self):
if os.fork() == 0: # first child
os.setsid() # create new session
if os.fork() > 0: # pragma: no cover
# second child
os._exit(0)
else:
os._exit(0)
return self
Not really. The process started with & runs in the background, but is attached to the shell that started it, and the process output goes to the terminal.
Meaning, if the shell dies or is killed (or the terminal is closed), that process will be sent a HUG signal and will die as well (if it doesn't catch it, or if its output goes to the terminal).
The command nohup detaches a process (command) from the shell and redirects its I/O, and prevents it from dying when the parent process (shell) dies.
Example:
You can see that by opening two terminals. In one run
sleep 500 &
in the other one run ps -ef to see the list of processes, and near the bottom something like
me 1234 1201 ... sleep 500
^ ^
process id parent process (shell)
close the terminal in which sleep sleeps in the background, and then do a ps -ef again, the sleep process is gone.
A daemon job is usually started by the system (its owner may be changed to a regular user) by upstart or init.

How to make killall close the terminal that the process is in?

So how can I close the terminal where the process is in with killall.
I have tried this:
In 1st terminal:
killall node
In 2nd terminal:
Ready
Terminated
But I want only the 2nd terminal to close after the node is killed.
You can use the -t option:
killall -t $(tty)
will call all processes started from the terminal session (even with nohup), including the shell. So, your terminal will get closed.
You need to also kill the process which runs the terminal, which is usually the parent process of the node process.
The question How do I get the parent process ID of a given child process? is a good place to start. You can find the PIDs of the node processes via How to find the Process ID of a running terminal program.

How to kill a process by its pid in linux

I'm new in linux and I'm building a program that receives the name of a process, gets its PID (i have no problem with that part) and then pass the PID to the kill command but its not working. It goes something like this:
read -p "Process to kill: " proceso
proid= pidof $proceso
echo "$proid"
kill $proid
Can someone tell me why it isn't killing it ? I know that there are some other ways to do it, even with the PID, but none of them seems to work for me. I believe it's some kind of problem with the Bash language (which I just started learning).
Instead of this:
proid= pidof $proceso
You probably meant this:
proid=$(pidof $proceso)
Even so,
the program might not get killed.
By default, kill PID sends the TERM signal to the specified process,
giving it a chance to shut down in an orderly manner,
for example clean up resources it's using.
The strongest signal to send a process to kill without graceful cleanup is KILL, using kill -KILL PID or kill -9 PID.
I believe it's some kind of problem with the bash language (which I just started learning).
The original line you posted, proid= pidof $proceso should raise an error,
and Bash would print an error message about it.
Debugging problems starts by reading and understanding the error messages the software is trying to tell you.
kill expects you to tell it **how to kill*, so there must be 64 different ways to kill your process :) They have names and numbers. The most lethal is -9. Some interesting ones include:
SIGKILL - The SIGKILL (also -9) signal forces the process to stop executing immediately. The program cannot ignore this signal. This process does not get to clean-up either.
SIGHUP - The SIGHUP signal disconnects a process from the parent process. This an also be used to restart processes. For example, "killall -SIGUP compiz" will restart Compiz. This is useful for daemons with memory leaks.
SIGINT - This signal is the same as pressing ctrl-c. On some systems, "delete" + "break" sends the same signal to the process. The process is interrupted and stopped. However, the process can ignore this signal.
SIGQUIT - This is like SIGINT with the ability to make the process produce a core dump.
use the following command to display the port and PID of the process:
sudo netstat -plten
AND THEN
kill -9 PID
Here is an example to kill a process running on port 8283 and has PID=25334
You have to send the SIGKILL flag with the kill statement.
kill -9 [pid]
If you don't the operating system will choose to kill the process at its convenience, SIGKILL (-9) will tell the os to kill the process NOW without ignoring the command until later.
Try this
kill -9
It will kill any process with PID given in brackets
Try "kill -9 $proid" or "kill -SIGKILL $proid" commands. If you want more information, click.
Based on what you have there, it looks like you aren't getting the actual PID in your proid variable. If you want to capture the output of pidof, you will need to enclose that command in backtics for the old form of command substitution ...
proid=`pidof $proceso`
... or like so for the new form of command substitution.
proid=$(pidof $proceso)
I had a similar problem, only wanting to run monitor (Video surveillance) for several hours a day.
Wrote two sh scripts;
cat startmotion.sh
#!/bin/sh
motion -c /home/username/.config/motion/motion.conf
And the second;
cat killmotion.sh
#!/bin/sh
OA=$(cat /var/run/motion/motion.pid)
kill -9 $OA
These were called from crontab at the scheduled time
ctontab -e
0 15 * * * /home/username/startmotion.sh
0 17 * * * /home/username/killmotion.sh
Very simple, but that's all I needed.

Simulating a process stuck in a blocking system call

I'm trying to test a behaviour which is hard to reproduce in a controlled environment.
Use case:
Linux system; usually Redhat EL 5 or 6 (we're just starting with RHEL 7 and systemd, so it's currently out of scope).
There're situations where I need to restart a service. The script we use for stopping the service usually works quite well; it sends a SIGTERM to the process, which is designed to handle it; if the process doesn't handle the SIGTERM within a timeout (usually a couple of minutes) the script sends a SIGKILL, then waits a couple minutes more.
The problem is: in some (rare) situations, the process doesn't exit after a SIGKILL; this usually happens when it's badly stuck on a system call, possibly because of a kernel-level issue (corrupt filesystem, or not-working NFS filesystem, or something equally bad requiring manual intervention).
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running; we're fixing this with a stronger locking system (so that at least the new process doesn't start if the old is running), but I find it difficult to test the whole thing because I haven't found a way to simulate an hard-stuck process.
So, the question is:
How can I manually simulate a process that doesn't exit when sending a SIGKILL to it, even as a privileged user?
If your process are stuck doing I/O, You can simulate your situation in this way:
lvcreate -n lvtest -L 2G vgtest
mkfs.ext3 -m0 /dev/vgtest/lvtest
mount /dev/vgtest/lvtest /mnt
dmsetup suspend /dev/vgtest/lvtest && dd if=/dev/zero of=/mnt/file.img bs=1M count=2048 &
In this way the dd process will stuck waiting for IO and will ignore every signal, I know the signals aren't ignore in the latest kernel when processes are waiting for IO on nfs filesystem.
Well... How about just not sending SIGKILL? So your env will behave like it was sent, but the process didn't quit.
Once a proces is in "D" state (or TASK_UNINTERRUPTIBLE) in a kernel code path where the execution can not be interrupted while a task is processed, which means sending any signals to the process would not be useful and would be ignored.
This can be caused due to device driver getting too many interrupts from the hardware, getting too many incoming network packets, data from NIC firmware or blocked on a HDD performing I/O. Normally if this happens very quickly and threads remain in this state for very short span of time.
Therefore what you need to be doing is look at the syslog and sar reports during the time when the process was stuck in D-state. If you find stack traces in the log, try to search kernel.bugzilla.org for similar issues or seek support from the Linux vendor.
I would code the opposite way. Have your server process write its pid in e.g. /var/run/yourserver.pid (this is common practice). Have the starting script read that file and test that the process does not exist e.g. with kill of signal 0, or with
yourserver_pid=$(cat /var/run/yourserver.pid)
if [ -f /proc/$yourserver_pid/exe ]; then
You could improve that by readlink /proc/$yourserver_pid/exe and comparing that to /usr/bin/yourserver
BTW, having a process still alive a few seconds after a SIGKILL is a serious situation (the common case when it could happen is if the process is stuck in a D state, waiting for some NFS server), and you probably should detect and syslog it (e.g. with logger in your script).
I also would try to first send SIGTERM, wait a few seconds, send SIGQUIT, wait a few seconds, and at last send SIGKILL and only a few seconds later test that the server process has gone
A bug arose when the script didn't realize that the "old" process hadn't actually exited and started a new process while the old was still running;
This is the bug in the OS/kernel level, not in your service script. The situation is rare and is hard to simulate because the OS is supposed to kill the process when SIGKILL signal happens. So I guess your goal is to let your script work well under a buggy kernel. Is that correct?
You can attach gdb to the process, SIGKILL won't remove such process from processlist but it will flag it as zombie, which might still be acceptable for your purpose.
void#tahr:~$ ping 8.8.8.8 > /tmp/ping.log &
[1] 3770
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 S 0:00 ping 8.8.8.8
void#tahr:~$ sudo gdb -p 3770
...
(gdb)
Other terminal
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 t 0:00 ping 8.8.8.8
sudo kill -9 3770
...
void#tahr:~$ ps 3770
PID TTY STAT TIME COMMAND
3770 pts/13 Z 0:00 [ping] <defunct>
First terminal again
(gdb) quit

Resources