My script runs on the head node but fails when I use Slurm to send to another node - slurm

I have some python code on a remote machine. When I run the code on the head node, it runs perfectly. When I use slurm the job begins, but after a few minutes it seems to stall (from looking at the output) and then after a while the job seems to die and disappear from the squeue without any error message. The command I run is:
sbatch --wrap="python my_code.py" -J run1 -N 1 --cpus-per-task=8 -o run1.o
I know it's difficult to say what's happening without knowing the details of the code, but I'm wondering what could cause a code to run on the head node of a remote machine, but stall (after partially running) when using sbatch to send the job (same code) to a node?

Related

Bash script how to run a command remotely and then exit the remote terminal

I'm trying to execute the command:
ssh nvidia#ubuntu-ip-address "/opt/ads2/arm-linux64/bin/ads2 svcd&"
This works so far except that it hangs in the remote terminal when "/opt/ads2/arm-linux64/bin/ads2 svcd&" is executed, unless i enter ctrl+c. So I'm looking for a command that, after executing the command, exits from the remote terminal and continue executing the local bash script.
thanks in advance
When you run a command in background on a terminal, regardless of weather it be local or remotely, if you attempt to logout most systems will warn you have running jobs. One further attempt to logout and your jobs get killed as you exit.
In order to avoid this you need to detach your running jobs from terminal.
if job is already running you can
disown -h <jobspec ar reported by jobs>
If you want to run something in background and then exit leaving it running you can use nohup
nohup command &
This is certainly ok on init systems ... not sure if it works exactly like this on systems that use systemd.

How to use the "watch" command with SSH

I have a script that monitors a specific server, giving me the disk usage, CPU usage, etc. I am using 2 Ubuntu VMs: I run the script on the server using SSH (ssh user#ip < script.sh from the first VM), and I want to make it show values in real time, so I tried 2 approaches I found on here:
1. while loop with clear
The first approach is using a while loop with "clear" to make the script run multiple times, giving new values every time and clearing the previous output like so:
while true
do
clear;
# bunch of code
done
The problem here is that it doesn't clear the terminal, it just keeps printing the new results one after another.
2. watch
The second approach uses watch:
watch -n 1 Script.sh
This works fine on the local machine (to monitor the current machine where the script is), but I can't find a way to make it run via SSH. Something like
ssh user#ip < 'watch -n 1 script.sh'
works in principle, but requires that the script be present on the server, which I want to avoid. Is there any way to run watch for the remote execution (via SSH) of a script that is present on the local machine?
For your second approach (using watch), what you can do instead is to run watch locally (from within the first VM) with an SSH command and piped-in script like this:
watch -n 1 'ssh user#ip < script.sh'
The drawback of this is that it will reconnect in each watch iteration (i.e., once a second), which some server configurations might not allow. See here for how to let SSH re-use the same connection for serial ssh runs.
But if what you want to do is to monitor servers, what I really recommend is to use a monitoring system like 'telegraf'.

Torque+MAUI PBS submitted job strange startup

I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub

jobs find nothing after nohup a script

I made a python script running on background:
nohup python app.py &
then close the terminal, a few days later, I want to see this job, so I run
jobs
there list no jobs, but I'm sure the script app.py is still running.
jobs will only give you a list of process running under the session group of the bash shell. nohup processes run in their own session group. There are a number of simple commands you can run to check if your nohup'd process is still running, the simplest being lsof | grep nohup (this command may take a few seconds to run)

Bash script on background: how to kill child processes

Well, I'm basically trying to make a bash script runs a node script forever. I made the following bash script:
#!/bin/bash
while true ; do
cd /myscope/
unlink nohup.out
node myscript.js
sleep 6
done & echo $! > pid
I'm expecting that when it runs, it starts up node with the given script, checks if node exits, sleeps for 6 seconds if so and reopen node. Also, I'm expecting it to run in background and writes it's pid (the bash pid) on a file called "pid".
Everything explained above works as expected, apparently, but I'm also expecting that when the pid of the bash script is killed, the node script would stop running, I don't know why that made sense in my mind, but when it comes to practice, it doesn't work. The bash script is killed indeed, but the node script keeps running and that is freaking me out.
I've tested it in the terminal, by not sending the bash script to the background and entering ctrl+c, both scripts gets killed.
I'm obviously miss understanding something on the way the background process works. For god sake, can anybody help me?
There are lots of tools that let you do what you're trying, just two off the top of my head:
https://github.com/nodejitsu/forever - A simple CLI tool for ensuring that a given script runs continuously (i.e. forever)
https://github.com/remy/nodemon - Monitor for any changes in your node.js application and automatically restart the server - perfect for development
Maybe the second it's not what you're looking for, but still worth a look.
If you can't or don't want to use those then the problem is that if you kill the parent process the child one is still there, so, you should kill that too:
pkill -TERM -P $PID
where $PID is the parent PID.

Resources