`nohup` issue with submitting `SLURM` job - linux

I have a python code main.py that runs bash script, the bash script inturn submits a job job.bash and obtains its JOBID using echo $JOBID | awk {'print $4'}. If I run python in the terminal, the bash script works and I am able to obtain and echo the JOBID as follows:
#!/bin/bash
JOBID=`sbatch ~/job.bash | tee output.log`
JOBID=`echo $JOBID | awk {'print $4'}`
echo $JOBID
Running above as part of python works in terminal python main.py, but doing nohup python main.py &, the echo does not print or store JOBID.
Any reason for this?
I am submitting a slurm job hence the JOBID is the pid from slurm
(Update Jul 17) Looks like the issue is with the command sbatch ~/job.bash | tee output.log, it doesnt get submitted using nohup and hence JOBID never gets stored and echo'd.
(Update Jul 18) As per the comments from #pynexj adding set -x in the script results:
nohup: ignoring input and redirecting stderr to stdout
+ date
Mon Jul 18 21:46:35 +03 2022
++ sbatch ~/job.bash
++ tee output.log
+ JOBID=
++ echo
++ awk '{print $4}'
+ JOBID=
+ echo
The issue still persists. It appears that nohup is incompatible with sbatch.
Question: Why should nohup prevent submission of slurm job? Its objective is merely to capture terminate signal?

If this problem only happens with nohup present, you can get the benefits of nohup without actually using it with:
yourscript </dev/null >file.log 2>&1 & disown -h "$!"
This does the following:
Redirects stdin from /dev/null with </dev/null
Redirects stdout and stderr to a log file with >file.log 2>&1
Tells the shell not to forward HUP signals to the background process with disown -h "$!"
...which is everything nohup does.

Related

Modifying files via slurm epilog script is not effective

I'm on CentOS 6.9 running slurm 17.11.7. I've modified my /gpfs0/export/slurm/conf/epilog script. I'm ultimately would like to print out job resource utilization information to the stdout file used be each users' job.
I've been testing it within the conditional at the end of the script for myself before I roll it out to other users. Below is my modified epilog script:
#!/bin/bash
# Clear out TMPDIR on the shared file system after job completes
exec >> /var/log/epilog.log
exec 2>> /var/log/epilog.log
if [ -z $SLURM_JOB_ID ]
then
echo -e " This script should be executed from slurm."
exit 1
fi
TMPDIR="/gpfs0/scratch/${SLURM_JOB_ID}"
rm -rf $TMPDIR
### My additions to the existing script ###
if [ "$USER" == "myuserid" ]
then
STDOUT=`scontrol show jobid ${SLURM_JOB_ID} | grep StdOut | awk 'BEGIN{FS="="}{print $2}'`
# Regular stdout/stderr is not respected, must use python.
python -c "import sys; stdout=sys.argv[1]; f=open(stdout, 'a'); f.write('sticks\n'); f.close();" ${STDOUT}
fi
exit 0
From the Prolog and Epilog section of the slurm.conf user manual it seems that stdout/stderr are not respected. Hence I modify the stdout file with python.
I've picked the compute node node21 to run this job, so I logged into node21 and tried several things to get it to notice my changes to the epilog script.
Reconfiguring slurmd:
sudo scontrol reconfigure
Restart slurm daemon:
sudo service slurm stop
sudo service slurm start
Neither of which seems to get the changes to the epilog script when I submit jobs. When put the same conditional in a batch script it runs flawlessly:
#!/bin/bash
#SBATCH --nodelist=node21
echo "Hello you!"
echo $HOSTNAME
if [ "$USER" == "myuserid" ]
then
STDOUT=`scontrol show jobid ${SLURM_JOB_ID} | grep StdOut | awk 'BEGIN{FS="="}{print $2}'`
python -c "import sys; stdout=sys.argv[1]; f=open(stdout, 'a'); f.write('sticks\n'); f.close();" ${STDOUT}
#echo "HELLO! ${USER}"
fi
QUESTION : Where am I going wrong?
EDIT : This is a MWE from within the context of trying to print resource utilization of jobs at the end of the output.
To get this, append the end of the epilog.log script with
# writing job statistics into job output
OUT=`scontrol show jobid ${SLURM_JOB_ID} | grep StdOut | awk 'BEGIN{FS="="}{print $2}'`
echo -e "sticks" >> ${OUT} 2>&1
There was no need to restart the slurm daemons. Additional commands can be added to it to get resource utilization, e.g.
sleep 5s ### Sleep to give chance for job to be written to slurm database for job statistics.
sacct --units M --format=jobid,user%5,state%7,CPUTime,ExitCode%4,MaxRSS,NodeList,Partition,ReqTRES%25,Submit,Start,End,Elapsed -j $SLURM_JOBID >> $OUT 2>&1
Basically, you can still append the output file using >>. Evidently, it did not occur to me that regular output redirection still works. It is still unclear why the python statement to this did not work.
According to this page, you can print to stdout from the Slurm prolog by prefacing your output with the 'print' command.
For example, instead of
echo "Starting prolog"
You need to do
echo "print Starting Prolog"
Unfortunately this only seems to work for the prolog, not the epilog.

I want to output "<PID> Killed ~" to logfile when it kill -9 <PID>

I want to output this message /usr/local/ex1.sh: line xxx: <PID> Killed ex2.sh >> $LOG_FILE 2>&1 to logfile.
however
The "ex1.sh" output /usr/local/ex1.sh: line xxx: <PID> Killed ex2.sh >> $LOG_FILE 2>&1 to console when I executed ex1.sh in console.
The result that i want is that "ex1.sh" output to file, not that output to console.
This source is "ex1.sh".
ex2.sh >> $LOG_FILE 2>&1 &
PID=`ps -ef | grep ex2.sh | grep -v grep | gawk '{print $2}'`
/bin/kill -9 $PID >> $LOG_FILE 2>&1 &
Why does "ex1.sh" output this message to console?
The reason is that message '/usr/local/ex1.sh: line xxx: <PID> Killed ex2.sh >> $LOG_FILE 2>&1 is given by bash shell, not by kill command.
So if you redirect kill command output to a file, you will not get the message in the file.
If running like ./ex1.sh >> $LOG_FILE 2&>1, the message will be in the log file. Because ./ex1.sh forks a new bash process, the bash process will give out the message.
The output is in fact not written by the kill command or ex2.sh. It is written by the shell executing the background process ex2.sh.
The shell executing the script started the script ex2.sh in the background as a child process and is monitoring it. When the script is killed, the shell acts on this by printing the message.
In your special case the shell knows more about the killed process and the process executing kill. So it prints a rather verbose message.
If you start ex2.sh (without '&') in terminal 1 and kill it from terminal 2, the shell in terminal 1 will just print "Killed".

Why nohup outputs process id?

When I'm running command nohup sh script.sh & in Terminal I have the following output:
[1] 42603
appending output to nohup.out. Where 42603 is process id of this command, but I don't want to see it. What can I do?
P.S. I'm running OSX Capitan, version 10.11.6
You can run nohup in a subshell and redirect the subshell's output to /dev/null like this: (nohup sh script.sh &) >/dev/null (note that this will also hide any output from sh script.sh)
something like this will mute that one line and will keep the script.sh connected to stdout
nohup sh script.sh & | grep -v nohup.out
if it is outputting that thing to stderr you will need to redirect to stdout
nohup sh script.sh 2>&1 & | grep -v nohup.out maybe the order is wrong there, my shell scripting syntax is usually wrong

Can i wait for a process termination that is not a child of current shell terminal?

I have a script that has to kill a certain number of times a resource managed by a high avialability middelware. It basically checks whether the resource is running and kills it afterwards, i need the timestamp of when the proc is really killed. So i have done this code:
#!/bin/bash
echo "$(date +"%T,%N") :New measures Run" > /home/hassan/logs/measures.log
for i in {1..50}
do
echo "Iteration: $i"
PID=`ps -ef | grep "/home/hassan/Desktop/pcmAppBin pacemaker_app/MainController"|grep -v "grep" | awk {'print$2'}`
if [ -n "$PID" ]; then
echo "$(date +"%T,%N") :Killing $PID" >> /home/hassan/logs/measures.log
ps -ef | grep "/home/hassan/Desktop/pcmAppBin pacemaker_app/MainController"|grep -v "grep" | awk {'print "kill -9 " $2'} | sh
wait $PID
else
PID=`ps -ef | grep "/home/hassan/Desktop/pcmAppBin pacemaker_app/MainController"|grep -v "grep" | awk {'print$2'}`
until [ -n "$PID" ]; do
sleep 2
PID=`ps -ef | grep "/home/hassan/Desktop/pcmAppBin pacemaker_app/MainController"|grep -v "grep" | awk {'print$2'}`
done
fi
done
But with my wait command i get the following error message: wait: pid xxxx is not a child of this shell
I assume that You started the child processes from bash and then start this script to wait for. The problem is that the child processes are not the children of the bash running the script, but the children of its parent!
If You want to launch a script inside the the current bash You should start with ..
An example. You start a vim and then You make is stop pressing ^Z (later you can use fg to get back to vim). Then You can get the list of jobs by using the˙jobs command.
$ jobs
[1]+ Stopped vim myfile
Then You can create a script called test.sh containing just one command, called jobs. Add execute right (e.g. chmod 700 test.sh), then start it:
$ cat test.sh
jobs
~/dev/fi [3:1]$ ./test.sh
~/dev/fi [3:1]$ . ./test.sh
[1]+ Stopped vim myfile
As the first version creates a new bash session no jobs are listed. But using . the script runs in the present bash script having exactly one chold process (namely vim). So launch the script above using the . so no child bash will be created.
Be aware that defining any variables or changing directory (and a lot more) will affect to your environment! E.g. PID will be visible by the calling bash!
Comments:
Do not use ...|grep ...|grep -v ... |awk --- pipe snakes! Use ...|awk... instead!
In most Linux-es you can use something like this ps -o pid= -C pcmAppBin to get just the pid, so the complete pipe can be avoided.
To call an external program from awk you could try system("mycmd"); built-in
I hope this helps a bit!

why nohup does not launch my script?

Here is my script.sh
for ((i=1; i<=400000; i++))
do
echo "loop $i"
echo
numberps=`ps -ef | grep php | wc -l`;
echo $numberps
if [ $numberps -lt 110 ]
then
php5 script.php &
sleep 0.25
else
echo too much process
sleep 0.5
fi
done
When I launch it with:
./script.sh > /dev/null 2>/dev/null &
that works except when I logout from SSH and login again, I cannot stop the script with kill%1 and jobs -l is empty
When I try to launch it with
nohup ./script.sh &
It just ouputs
nohup: ignoring input and appending output to `nohup.out'
but no php5 are running: nohup has no effect at all
I have 2 aleternatives to solve my problem:
1) ./script.sh > /dev/null 2>/dev/null &
If I logout from SSH and login again, How can I delete this job ?
or
2) How to make nohup run correctly ?
Any idea ?
nohup is not supposed to allow you to use jobs -l or kill %1 to kill jobs after logging out and in again.
Instead, you can
Run the script in the foreground in a GNU Screen or tmux session, which lets you log out, log in, reattach and continue the same session.
killall script.sh to kill all running instances of script.sh running on the server.

Resources