How do I find the current process running on a particular PBS job - linux

I am trying to write a script to provide diagnostics on processes. I have submitted a script to a job scheduling server using qsub. I can easily find the node that the job gets sent to. But I would like to be able to find what process is currently being run.
ie. I have a list of different commands in the submitted script, how can I find the current one that is running, and the arguments passed to it?
example of commands in script
matlab -nodesktop -nosplash -r "display('here'),quit"
python runsomethings.py
I would like to see whether the nodes is currently executing the first or second line.

When you submit a job, pbs_server pass your task to pbs_mom. pbs_mom process/daemon actually executes your script on the execution node. It
"creates a new session as identical user."
This means invoking a shell. You specialize the shell at the top of the script marking your choice with shebang: #!/bin/bash).
It's clear, that pbs_mom stores process (shell) PID somewhere to kill the job and to monitor if the job (shell process) have finished.
UPD. based on #Dmitri Chubarov comment: pbs_mom stores subshell PID internally in memory after calling fork(), and in the .TK file which is under torque installation directory: /var/spool/torque/mom_priv/jobs on my system.
Dumping file internals in decimal mode (<job_number>, <queue_name> should be your own values):
$ hexdump -d /var/spool/torque/mom_priv/jobs/<job_number>.<queue_name>.TK
have disclosed, that in my torque implementation it is stored in position
00000890 + offset 4*2 = 00000898 (it is hex value of first byte of PID in .TK file) and has a length of 2 bytes.
For example, for shell PID=27110 I have:
0000890 00001 00000 00001 00000 27110 00000 00000 00000
Let's recover PID from .TK file:
$ hexdump -s 2200 -n 2 -d /var/spool/torque/mom_priv/jobs/<job_number>.<queue_name>.TK | tr -s ' ' | cut -s -d' ' -f 2
27110
This way you've found subshell PID.
Now, monitor process list on the execution node and find name of child processes (getcpid function is a slighlty modified version of that posted earlier on SO):
function getcpid() {
cpids=`pgrep -P $1|xargs`
for cpid in $cpids;
do
ps -p "$cpid" -o comm=
getcpid $cpid
done
}
At last,
getcpid <your_PID>
gives you the child processes' names (note, there will be some garbage lines, like task numbers). This way you will finally know, what command is currently running on the execution node.
Of course, for each task monitored, you should obtain the PID and process name on the execution node after doing
ssh <your node>
You can automatically retrieve node name(s) in <node/proc+node/proc+...> format (process it further to obtain bare node names):
qstat -n <job number> | awk '{print $NF}' | grep <pattern_for_your_node_names>
Note:
The PID method is reliable and, as I believe, optimal.
Search by name is worse, it provides you unambiguous result only if your invoke different commands in your scripts, and no user executes the same software on the node.
ssh <your node>
ps aux | grep matlab
You will know if matlab runs.

Simple and elegant way to do it is to print to a log file
`
ARGS=" $A $B $test "
echo "running MATLAB now with args: $ARGS" >> $LOGFILE
matlab -nodesktop -nosplash -r "display('here'),quit"
PYARGS="$X $Y"
echo "running Python now with args: $ARGS" >> $LOGFILE
python runsomethings.py
`
And monitor the output of $LOGFILE using tail -f $LOGFILE

Related

Bash script to list all processes in the foreground process group of a terminal

How can I write a bash script to print out the PIDs of all processes in the foreground process group of a given terminal (which is different from the one in which I run the script)? I know that the C function tcgetpgrp can do the job, but I am wondering if there exist any command line utilities that can do this more easily.
To find the pids of all processes in the foreground process group of pts/29, you can do (on linux):
ps ao stat=,pid=,tty= | awk '$1 ~ /\+/ && $3 ~ /pts\/29/{ print $2}'
ps is often different, and I am uncertain of the portability of that solution.
You can use pgrep's -t flag, which enables you to list process using a given tty.
For example :
# on a first ssh session, which gets pts/0 :
sleep 10
# on a second ssh session :
pgrep -t "pts/0"
1234 # the first session's bash process
5678 # the first session's sleep process

Linux bash script that kills a process (not started by me) after x amount of time

I'm pretty inexperienced with Linux bash. That being said, I have a CentOS7 machine that runs a COTS application server. This application server runs other processes that sometimes hang. Since I have no control over the start of these processes, I'm looking for a script that runs every 2 minutes that kills processes of the name "spicer" that have been running for longer than 10 minutes. I've looked around and have only been able to find answers for processes that are run and owned by me.
I use the command ps -eo pid, command,etime | grep spicer to get all the spicer processes. The output of this command looks like:
18216 spicer -l/opt/otmm-10.5/Spi 14:20
18415 spicer -l/opt/otmm-10.5/Spi 11:49
etc...
18588 grep --color=auto spicer
I don't know if there's a way to parse this directly in bash. I'm also not well-versed at all in other Linux tools. I know that awk (or gawk) could possibly help.
EDIT
I have no control over the data that the process is working on.
What about wrapping the executable of spicer and start it using the timeout command? Let's say it is installed in /usr/bin/spicer. Then issue:
cp /usr/bin/spicer{,.orig}
echo '#!/bin/bash' > /usr/bin/spicer
echo 'timeout 10m spicer.orig "$#"' >> /usr/bin/spicer
Another approach would be to create a cronjob defintion into /etc/cron.d/kill_spicer. Like this:
* * * * * root kill $(ps --no-headers -C spicer -o pid,etimes | awk '$2>=600{print $1}')
The cronjob will get executed minutely and uses ps to obtain a list of spicer processes that run longer than 10minutes and passes them to kill.
Probably you even want kill -9 if the process is hanging.
You can use the -C option of ps to select processes by name.
ps --no-headers -C spicer -o pid,etime
Then you can use cut to filter the results, if the spacing is consistent. On my system the pid field takes up 8 characters, so I'd use
kill $(ps --no-headers -C spicer -o pid,etime | cut -c-8)
If the spacing is inconsistent (but if so, what kind of messed up ps are you using? :-P), you can use awk { print $1 } instead of cut.

Run a script in the same shell(bash)

My problem is specific to the running of SPECCPU2006(a benchmark suite).
After I installed the benchmark, I can invoke a command called "specinvoke" in terminal to run a specific benchmark. I have another script, where part of the codes are like following:
cd (specific benchmark directory)
specinvoke &
pid=$!
My goal is to get the PID of the running task. However, by doing what is shown above, what I got is the PID for the "specinvoke" shell command and the real running task will have another PID.
However, by running specinvoke -n ,the real code running in the specinvoke shell will be output to the stdout. For example, for one benchmark,it's like this:
# specinvoke r6392
# Invoked as: specinvoke -n
# timer ticks over every 1000 ns
# Use another -n on the command line to see chdir commands and env dump
# Starting run for copy #0
../run_base_ref_gcc43-64bit.0000/milc_base.gcc43-64bit < su3imp.in > su3imp.out 2>> su3imp.err
Inside it it's running a binary.The code will be different from benchmark to benchmark(by invoking under different benchmark directory). And because "specinvoke" is installed and not just a script, I can not use "source specinvoke".
So is there any clue? Is there any way to directly invoke the shell command in the same shell(have same PID) or maybe I should dump the specinvoke -n and run the dumped materials?
You can still do something like:
cd (specific benchmark directory)
specinvoke &
pid=$(pgrep milc_base.gcc43-64bit)
If there are several invocation of the milc_base.gcc43-64bit binary, you can still use
pid=$(pgrep -n milc_base.gcc43-64bit)
Which according to the man page:
-n
Select only the newest (most recently started) of the matching
processes
when the process is a direct child of the subshell:
ps -o pid= -C=milc_base.gcc43-64bit --ppid $!
when not a direct child, you could get the info from pstree:
pstree -p $! | grep -o 'milc_base.gcc43-64bit(.*)'
output from above (PID is in brackets): milc_base.gcc43-64bit(9837)

count processes in shell script [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Quick-and-dirty way to ensure only one instance of a shell script is running at a time
I am new to shell script.
what I wanna do is to avoid running multiple instances of a script.
I have this shell script cntps.sh
#!/bin/bash
cnt=`ps -e|grep "cntps"|grep -v "grep"`
echo $cnt >> ~/cntps.log
if [ $cnt < 1 ];
then
#do something.
else
exit 0
fi
if I run it this way $./cntps.sh, it echoes 2
if I run it this way $. ./cntps.sh, it echoes 0
if I run it with crontab, it echoes 3
Could somebody explain to me why is this happening?
And what is the proper way to avoid running multiple instances of a script?
I changed your command slightly to output ps to a log file so we can see what is going on.
cnt=`ps -ef| tee log | grep "cntps"|grep -v "grep" | wc -l`
This is what I saw:
32427 -bash
20430 /bin/bash ./cntps.sh
20431 /bin/bash ./cntps.sh
20432 ps -ef
20433 tee log
20434 grep cntps
20435 grep -v grep
20436 wc -l
As you can see, my terminal's shell (32427) spawns a new shell (20430) to run the script. The script then spawns another child shell (20431) for command substitution (`ps -ef | ...`).
So, the count of two is due to:
20430 /bin/bash ./cntps.sh
20431 /bin/bash ./cntps.sh
In any case, this is not a good way to ensure that only one process is running. See this SO question instead.
Firstly, I would recommend using pgrep rather than this method. Secondly I presume you're missing a wc -l to count the number of instances from the script
In answer to your counting problems:
if I run it this way $./cntps.sh, it echoes 2
This is because the backtick call: ps -e ... is triggering a subshell which is also called cntps.sh and this triggers two items
if I run it this way $. ./cntps.sh, it echoes 0
This is caused as you're not running, it but are actually sourcing it into the currently running shell. This causes there to be no copies of the script running by the name cntps
if I run it with crontab, it echoes 3
Two from the invocation, one from the crontab invocation itself which spawns sh -c 'path/to/cntps.sh'
Please see this question for how to do a single instance shell script.
Use a "lock" file as a mutex.
if(exists("lock") == false)
{
touch lock file // create a file named "lock" in the current dir
execute_script_body // execute script commands
remove lock file // delete the file
}
else
{
echo "another instance is running!"
}
exit

Redirecting Output of Bash Child Scripts

I have a basic script that outputs various status messages. e.g.
~$ ./myscript.sh
0 of 100
1 of 100
2 of 100
...
I wanted to wrap this in a parent script, in order to run a sequence of child-scripts and send an email upon overall completion, e.g. topscript.sh
#!/bin/bash
START=$(date +%s)
/usr/local/bin/myscript.sh
/usr/local/bin/otherscript.sh
/usr/local/bin/anotherscript.sh
RET=$?
END=$(date +%s)
echo -e "Subject:Task Complete\nBegan on $START and finished at $END and exited with status $RET.\n" | sendmail -v group#mydomain.com
I'm running this like:
~$ topscript.sh >/var/log/topscript.log 2>&1
However, when I run tail -f /var/log/topscript.log to inspect the log I see nothing, even though running top shows myscript.sh is currently being executed, and therefore, presumably outputting status messages.
Why isn't the stdout/stderr from the child scripts being captured in the parent's log? How do I fix this?
EDIT: I'm also running these on a remote machine, connected via ssh using pseudo-tty allocation, e.g. ssh -t user#host. Could the pseudo-tty be interfering?
I just tried your the following: I have three files t1.sh, t2.sh, and t3.sh all with the following content:
#!/bin/bash
for((i=0;i<10;i++)) ; do
echo $i of 9
sleep 1
done
And a script called myscript.sh with the following content:
#!/bin/bash
./t1.sh
./t2.sh
./t3.sh
echo "All Done"
When I run ./myscript.sh > topscript.log 2>&1 and then in another terminal run tail -f topscript.log I see the lines being output just fine in the log file.
Perhaps the things being run in your subscripts use a large output buffer? I know when I've run python scripts before, it has a pretty big output buffer so you don't see any output for a while. Do you actually see the entire output in the email that gets sent out at the end of topscript.sh? Is it just that while the processes run you're not seeing the output?
try
unbuffer topscript.sh >/var/log/topscript.log 2>&1
Note that unbuffer is not always available as a std binary in old-style Unix platforms and may require a search and installation for a package to support it.
I hope this helps.

Resources