How would you check if SLURM or MOAB/Torque is available on an environment? - slurm

The title kind of says it all. I'm looking for a command line test to check if either SLURM, or MOAB/Torque is available for submitting jobs too.
My thought is to check if the command qstat finishes with exit code, or if squeue finished with exit code zero. Would this be the best way of doing this?

One of the most lightweight way to do that is to test for the presence of sbatch for instance, with
which sbatch
The which command exits with 0 exit code if the command is found in the PATH.
Make sure to test in order, as, for instance, a Slurm cluster could have a qsub command available to emulate PBS or Torque.

Related

Import bash variables into slurm script

I have seen similar questions, but not exactly the same as mine: Use Bash variable within SLURM sbatch script, because I am not talking about slurm parameters.
I want to launch a slurm job for each of my sample files, so imagine I have 3 vcfs and I want to run a job for each of them:
I created a script to loop through a file in which I wrote sampleIds to run another script with each sample, which would perfectly work if I wanted to run it directly with bash:
while read line
do
sampleID="${line[0]}"
myscript.sh $sampleID
The problem is that I need to run the script with slurm, so is there any way to indicate slurm the bash variable that it should include?
I was trying this, but it is not working:
sbatch myscrip.sh --export=$sampleID
Okay, I've solved it:
sbatch --export=sampleID=$sampleID myscript.sh

how to see the command file in slurm

If I submitted a job using slurm, how can a see the full command file that was submitted, usually a bash script that start with :
#SBATCH --job-name=
how can I use slurm command to see the content of this script file (I know I can do scontorol show jobId -xxx and see where the script is, but assuming it was deleted or changed, and I still want to see the original script that was submitted to slurm)
Thanks.
On older versions of Slurm (17.02 and before), you can do scontrol show -dd job <jobid> to have Slurm display the content of the submission script it stored, and in newer versions of Slurm (17.11 and after), you would use scontrol write batch_script <jobid> -.

torque qsub making jobs dependent on other jobs

I'd like to start a bunch of jobs using qsub, and the final job should only run if all the others finished "without error". In my case "without error" means they exited with status=0. The man page for qsub says in the -W depend=afterok description that: This job may be scheduled for execution only after jobs jobid have terminated with no errors.
Unfortunately it does not seem to explain (or I can't find) what it means by "with no errors". It is likely that some of my scripts will print information to stderr, but I don't want that to be interpreted as an error.
Question 1: What does the qsub documentation mean by "with no errors"?
Question 2: How can I make a job dependent explicitly on all of a collection of jobs exiting with status 0?
With no errors = exited with a status of 0. If the jobs exit with a non-zero exit status, it is considered an error.
You can chain dependencies: qsub -W depend=afterok:job1:job2:job3

Bash Script for Submitting Job on Cluster

I am trying to write a script so I can use the 'qsub' command to submit a job to the cluster.
Pretty much, once I get into the cluster, I go to the directory with my files and I do these steps:
export PATH=$PATH:$HOME/program/bin
Then,
program > run.log&
Is there any way to make this into a script so I am able to submit the job to the queue?
Thanks!
Putting the lines into a bash script and then running qsub myscript.sh should do it.

what does "bash:no job control in this shell” mean?

I think it's related to the parent process creating new subprocess and does not have tty. Can anyone explain the detail under the hood? i.e. the related working model of bash, process creation, etc?
It may be a very broad topic so pointers to posts are also very appreciated. I've Googled for a while, all the results are about very specific case and none is about the story behind the scene. To provide more context, below is the shell script resulting the 'bash: no job control in this shell'.
#! /bin/bash
while [ 1 ]; do
st=$(netstat -an |grep 7070 |grep LISTEN -o | uniq)
if [ -z $st ]; then
echo "need to start proxy #$(date)"
bash -i -c "ssh -D 7070 -N user#my-ssh.example.com > /dev/null"
else
echo "proxy OK #$(date)"
fi
sleep 3
done
This line:
bash -i -c "ssh -D 7070 -N user#my-ssh.example.com > /dev/null"
is where "bash:no job control in this shell” come from.
You may need to enable job control:
#! /bin/bash
set -m
Job control is a collection of features in the shell and the tty driver which allow the user to manage multiple jobs from a single interactive shell.
A job is a single command or a pipeline. If you run ls, that's a job. If you run ls|more, that's still just one job. If the command you run starts subprocesses of its own, then they will also belong to the same job unless they are intentionally detached.
Without job control, you have the ability to put a job in the background by adding & to the command line. And that's about all the control you have.
With job control, you can additionally:
Suspend a running foreground job with CtrlZ
Resume a suspended job in the foreground with fg
Resume a suspend job in the background with bg
Bring a running background job into the foreground with fg
The shell maintains a list of jobs which you can see by running the jobs command. Each one is assigned a job number (distinct from the PIDs of the process(es) that make up the job). You can use the job number, prefixed with %, as an argument to fg or bg to select a job to foreground or background. The %jobnumber notation is also acceptable to the shell's builtin kill command. This can be convenient because the job numbers are assigned starting from 1, so they're shorter than PIDs.
There are also shortcuts %+ for the most recently foregrounded job and %- for the previously foregrounded job, so you can switch back and forth rapidly between two jobs with CtrlZ followed by fg %- (suspend the current one, resume the other one) without having to remember the numbers. Or you can use the beginning of the command itself. If you have suspended an ffmpeg command, resuming it is as easy as fg %ff (assuming no other active jobs start with "ff"). And as one last shortcut, you don't have to type the fg. Just entering %- as a command foregrounds the previous job.
"But why do we need this?" I can hear you asking. "I can just start another shell if I want to run another command." True, there are many ways of multitasking. On a normal day I have login shells running on tty1 through tty10 (yes there are more than 6, you just have to activate them), one of which will be running a screen session with 4 screens in it, another might have an ssh running on it in which there is another screen session running on the remote machine, plus my X session with 3 or 4 xterms. And I still use job control.
If I'm in the middle of vi or less or aptitude or any other interactive thing, and I need to run a couple of other quick commands to decide how to proceed, CtrlZ, run the commands, and fg is natural and quick. (In lots of cases an interactive program has a ! keybinding to run an external command for you; I don't think that's as good because you don't get the benefit of your shell's history, command line editor, and completion system.) I find it sad whenever I see someone launch a secondary xterm/screen/whatever to run one command, look at it for two seconds, and then exit.
Now about this script of yours. In general it does not appear to be competently written. The line in question:
bash -i -c "ssh -D 7070 -N user#my-ssh.example.com > /dev/null"
is confusing. I can't figure out why the ssh command is being passed down to a separate shell instead of just being executed straight from the main script, let alone why someone added -i to it. The -i option tells the shell to run interactively, which activates job control (among other things). But it isn't actually being used interactively. Whatever the purpose was behind the separate shell and the -i, the warning about job control was a side effect. I'm guessing it was a hack to get around some undesirable feature of ssh. That's the kind of thing that when you do it, you should comment it.
One of the possible options would be not having access to the tty.
Under the hood:
bash checks whether the session is interactive, if not - no job
control.
if forced_interactive is set, then check that stderr is
attached to a tty is skipped and bash checks again, whether in can
open /dev/tty for read-write access.
then it checks whether new line discipline is used, if not, then job control is disabled too.
If (and only if) we just set our process group to our pid, thereby becoming a process group leader, and the terminal is not in the same process group as our (new) process group, then set the terminal's process group to our (new) process group. If that fails, set our process group back to what it was originally (so we can still read from the terminal) and turn off job control.
if all of the above has failed, you see the message.
I partially quoted the comments from bash source code.
As per additional request of the question author:
http://tiswww.case.edu/php/chet/bash/bashtop.html Here you can find bash itself.
If you can read the C code, get the source tarball, inside it you will find job.c - that one will explain you more "under the hood" stuff.
I ran into a problem on my own embedded system and I got rid of the "no job control" error by running the getty process with "setsid", which according to its manpage starts a process with a new session id.
Faced this problem only because I've miscopied a previously executed command together with % prefix into zsh, like % echo this instead of echo this. The error was very unclear to such a stupid typo

Resources