torque qsub making jobs dependent on other jobs - qsub

I'd like to start a bunch of jobs using qsub, and the final job should only run if all the others finished "without error". In my case "without error" means they exited with status=0. The man page for qsub says in the -W depend=afterok description that: This job may be scheduled for execution only after jobs jobid have terminated with no errors.
Unfortunately it does not seem to explain (or I can't find) what it means by "with no errors". It is likely that some of my scripts will print information to stderr, but I don't want that to be interpreted as an error.
Question 1: What does the qsub documentation mean by "with no errors"?
Question 2: How can I make a job dependent explicitly on all of a collection of jobs exiting with status 0?

With no errors = exited with a status of 0. If the jobs exit with a non-zero exit status, it is considered an error.
You can chain dependencies: qsub -W depend=afterok:job1:job2:job3

Related

How would you check if SLURM or MOAB/Torque is available on an environment?

The title kind of says it all. I'm looking for a command line test to check if either SLURM, or MOAB/Torque is available for submitting jobs too.
My thought is to check if the command qstat finishes with exit code, or if squeue finished with exit code zero. Would this be the best way of doing this?
One of the most lightweight way to do that is to test for the presence of sbatch for instance, with
which sbatch
The which command exits with 0 exit code if the command is found in the PATH.
Make sure to test in order, as, for instance, a Slurm cluster could have a qsub command available to emulate PBS or Torque.

Is the `after_script` always executed, even for cancelled jobs?

The documentation isn't clear on whether the after_script is executed for cancelled jobs:
after_script is used to define the command that will be run after for all jobs, including failed ones.
I'm doing potentially critical cleanup in the after_script and while cancelled jobs should be rare, I'd like to know that my clean up is guaranteed to happen.
No, I ran some tests and here are the behaviours I exhibited:
after_script:
- echo "This is not executed when a job is cancelled."
- echo "A failing command, like this one, doesn't fail the job." && false
- echo "This is not executed because the previous command failed."
1. after_script is not executed when a job is cancelled
There's an open issue for this on gitlab.com, so if this is affecting you, head over there and make some noise.
2. If a command in the after_script fails, the rest aren't executed
This is quite easy to work around:
after_script:
- potentially failing command || true
- next command
Replace potentially failing command with your command and the next command will execute regardless of whether potentially failing command passed or failed.
One could argue that this behaviour is actually desired, as it gives some flexibility to the user, but it might be counterintuitive to some.

crash-stopping bash pipeline [duplicate]

This question already has answers here:
How do you catch error codes in a shell pipe?
(5 answers)
Closed 7 years ago.
I have a pipeline, say a|b where if a runs into a problem, I want to stop the whole pipeline.
'a exiting with exit=1 doesn't do this as often 'b doesn't care about return codes.
e.g.
echo 1|grep 0|echo $? <-- this shows that grep did exit=1
but
echo 1|grep 0 | wc <--- wc is unfazed by grep's exit here
If I ran the pipeline as a subprocess of an owning process, any of the pipeline processes could kill the owning process. But this seems a bit clumsy -- but it would zap the whole pipeline.
Not possible with basic shell constructs, probably not possible in shell at all.
Your first example doesn't do what you think. echo doesn't use standard input, so putting it on the right side of a pipe is never a good idea. The $? that you're echoing is not the exit value of the grep 0. All commands in a pipeline run simultaneously. echo has already been started, with the existing value of $?, before the other commands in the pipeline have finished. It echoes the exit value of whatever you did before the pipeline.
# The first command is to set things up so that $? is 2 when the
# second command is parsed.
$ sh -c 'exit 2'
$ echo 1|grep 0|echo $?
2
Your second example is a little more interesting. It's correct to say that wc is unfazed by grep's exit status. All commands in the pipeline are children of the shell, so their exit statuses are reported to the shell. The wc process doesn't know anything about the grep process. The only communication between them is the data stream written to the pipe by grep and read from the pipe by wc.
There are ways to find all the exit statuses after the fact (the linked question in the comment by shx2 has examples) but a basic rule that you can't avoid is that the shell will always wait for all the commands to finish.
Early exits in a pipeline sometimes do have a cascade effect. If a command on the right side of a pipe exits without reading all the data from the pipe, the command on the left of that pipe will get a SIGPIPE signal the next time it tries to write, which by default terminates the process. (The 2 phrases to pay close attention to there are "the next time it tries to write" and "by default". If a the writing process spends a long time doing other things between writes to the pipe, it won't die immediately. If it handles the SIGPIPE, it won't die at all.)
In the other direction, when a command on the left side of a pipe exits, the command on the right side of that pipe gets EOF, which does cause the exit to happen fairly soon when it's a simple command like wc that doesn't do much processing after reading its input.
With direct use of pipe(), fork(), and wait3(), it would be possible to construct a pipeline, notice when one child exits badly, and kill the rest of them immediately. This requires a language more sophisticated than the shell.
I tried to come up with a way to do it in shell with a series of named pipes, but I don't see it. You can run all the processes as separate jobs and get their PIDs with $!, but the wait builtin isn't flexible enough to say "wait for any child in this set to exit, and tell me which one it was and what the exit status was".
If you're willing to mess with ps and/or /proc you can find out which processes have exited (they'll be zombies), but you can't distinguish successful exit from any other kind.
Write
set -e
set -o pipefail
at the beginning of your file.
-e will exit on an error and -o pipefail will produce an errorcode on each stage of you "pipeline"

Concurrency with shell scripts in failure-prone environments

Good morning all,
I am trying to implement concurrency in a very specific environment, and keep getting stuck. Maybe you can help me.
this is the situation:
-I have N nodes that can read/write in a shared folder.
-I want to execute an application in one of them. this can be anything, like a shell script, an installed code, or whatever.
-To do so, I have to send the same command to all of them. The first one should start the execution, and the rest should see that somebody else is running the desired application and exit.
-The execution of the application can be killed at any time. This is important because does not allow relying on any cleaning step after the execution.
-if the application gets killed, the user may want to execute it again. He would then send the very same command.
My current approach is to create a shell script that wraps the command to be executed. This could also be implemented in C. Not python or other languages, to avoid library dependencies.
#!/bin/sh
# (folder structure simplified for legibility)
mutex(){
lockdir=".lock"
firstTask=1 #false
if mkdir "$lockdir" &> /dev/null
then
controlFile="controlFile"
#if this is the first node, start coordinator
if [ ! -f $controlFile ]; then
firstTask=0 #true
#tell the rest of nodes that I am in control
echo "some info" > $controlFile
fi
# remove control File when script finishes
trap 'rm $controlFile' EXIT
fi
return $firstTask
}
#The basic idea is that a task executes the desire command, stated as arguments to this script. The rest do nothing
if ! mutex ;
then
exit 0
fi
#I am the first node and the only one reaching this, so I execute whatever
$#
If there are no failures, this wrapper works great. The problem is that, if the script is killed before the execution, the trap is not executed and the control file is not removed. Then, when we execute the wrapper again to restart the task, it won't work as every node will think that somebody else is running the application.
A possible solution would be to remove the control script just before the "$#" call, but that it would lead to some race condition.
Any suggestion or idea?
Thanks for your help.
edit: edited with correct solution as future reference
Your trap syntax looks wrong: According to POSIX, it should be:
trap [action condition ...]
e.g.:
trap 'rm $controlFile' HUP INT TERM
trap 'rm $controlFile' 1 2 15
Note that $controlFile will not be expanded until the trap is executed if you use single quotes.

How to queue up a job

Is it possible to queue up a job that depends on a running job's output, so the new job waits until the running job terminates?
Hypothetical example: You should have run:
./configure && make
but you only ran:
./configure
and now you want to tell make to get on with it once configure (successfully) finishes, while you go do something useful like have a nap? The same scenario occurs with many other time-consuming jobs.
(The basic job control commands -- fg, bg, jobs, kill, &, ctrl-Z -- don't do this, as far as I know. The question arose on bash/Ubuntu, but I'd be interested in a general *nix solution, if it exists.)
I presume you're typing these commands at a shell prompt, and the ./configure command is still running.
./configure
# oops, forgot type type "make"
[ $? -eq 0 ] && make
The command [ $? -eq 0 ] will succeed if and only if the ./configure command succeeds.
(You could also use [ $? = 0 ], which does a string comparison rather than a numeric comparison.)
(If you're willing to assume that the ./configure command will succeed, you can just type make.)
Stealing and updating an idea from chepner's comment, another possibility is to suspend the job by typing Ctrl-Z, then put it in the background (bg), then:
wait %% && make
wait %% waits for "current" job, which is "the last job stopped while it was in the foreground or started in the background". This can be generalized to wait for any job by replacing %% by a different job specification.
You can simplify this to
wait && make
if you're sure you have no other background jobs (wait with no arguments waits for all background jobs to finish).
Referring to the previous process return code with $?:
test $? -eq 0 && make
I'm not sure to understand your needs, but I often use batch(1) (from package atd on Debian) to compile, in a here document like this:
batch << EOJ
make > _make.log 2>&1
EOJ
Of course it makes only sense if your configure did run successfully and completely
Then in some terminal I might follow the compilation with tail -f _make.log (provided I am in the good directory). You can get a coffee -or a lunch, or sleep a whole night- (and even logout) during the compilation.

Resources