PBS automatically restart failed jobs - pbs

I use PBS job arrays to submit a number of jobs. Sometimes a small number of jobs get screwed up and not been ran successfully. Is there a way to automatically detect the failed jobs and restart them?

pbs_server supports automatic_requeue_exit_code:
an exit code, defined by the admin, that tells pbs_server to requeue the job instead of considering it as completed. This allows the user to add some additional checks that the job can run meaningfully, and if not, then the job script exits with the specified code to be requeued.
There is also a provision for requeuing jobs in the case where the prologue fails (see the prologue/epilogue script documentation).
There are probably more sophisticated ways of doing this, but they would fall outside the realm of built-in Torque options.

Related

Recover Slurm job submission script from an old job?

I accidentally removed a job submission script for a Slurm job in terminal using rm command. As far as I know there are no (relatively easy) ways of recovering that file anymore, and I hadn't saved it anywhere. I have used that job submission script many many times before, so there are a lot of Slurm job submissions (all of them finished) that have used it. Is it possible to recover that job script from an old finished job somehow?
If Slurm is configured with the ElasticSearch plugin, then you will find the submission script for all completed jobs in the ElasticSearch instance used in the setup.
Another option is to install sarchive

Oozie: kill a job after a timeout

Sorry but can't find he configuration point a need. I schedule spark application, sometimes they may not succeed after 1 hour, in this case I want to automatically kill this task (because I am sure it will never succeed, and another scheduling may start).
I found a timeout configuration, but as I understand it, this is used to delay the start of a workflow.
So is there a kind of living' timeout ?
Oozie cannot kill a workflow that it triggered. However you can ensure that a single workflow is running at same time by setting Concurrency = 1 in the Coordinator.
Also you can have a second Oozie workflow monitoring the status of the Spark job.
Anyawy, you should investigate the root cause of Spark job not successful or being blocked.

launching parallel bsub job in clearcase environment

ClearCase does not work in conjunction with LSF distributed multi-host parallel job if more than 1 hosts are specified.
Reason: ClearCase does not mount the file system on all hosts when dispatching multi-host simulations to the LSF system
the job is terminated because included files are not found or cannot be output because the file system does not exist on all hosts.
The ClearCase + LSF implementation has to guarantee by construction that the job is dispatched correctly in 100% of all cases, which is currently not the case.
please help me on this issue.
The LSF/Clearcase integration uses the daemon.wrap program to set the view on the execution host and then launch the job inside the view. That wrapper doesn't support cross-host parallel jobs.
You'll have to try to work around the limitation in your job script. You can disable the daemon wrapper by making sure the $CLEARCASE_ROOT is not set in your job submission environment. Then in the job script, in the execution environment, and in each process that is participating in the parallel job the job script can call cleartool setview <options> <real job command>.
If you launch your job with blaunch then it might make things easier. Without blaunch, LSF will start a single process on the first execution host. With blaunch, LSF will launch one process per slot, and launch it on all of the allocated execution hosts. With blaunch, each process can then set the view and start the real job.
Good luck!

Gitlab pipeline jobs in the same stage are not running in parallel

We have 4 deploy jobs in the same stage that can be run concurrently. From the Gitlab docs:
The ordering of elements in stages defines the ordering of jobs' execution:
Jobs of the same stage are run in parallel.
Jobs of the next stage are run after the jobs from the previous stage complete successfully.
What happens, however, is that only one of the jobs run at a time and the others stay on pending. Is there perhaps other things that I need to do in order to get it to execute in parallel. I'm using a runner with a shell executor hosted on an Ubuntu 16.04 instance.
Your runner should be configured to enable concurrent jobs( see https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
concurrent = 4
or you may want to setup several runners.
I also ran into this problem. I needed to run several tasks at the same time. I used everything I could find (from needs to parallel). however, my tasks were still performed sequentially. every task I had was on standby. the solution turned out to be very simple. open file /etc/gitlab-runner/config.toml concurent for the required number of parallel tasks for you.

How can I read the PBS launch script of a job that is running?

I am using /torque/4.2.5 to schedule my jobs and I need to find a way to make a copy of my PBS launching script that I used for jobs that are currently queueing or running. The plan is to make a copy of that launch script in the output folder.
TORQUE has a job logging feature that can be configured to record job scripts used at launch time.
EDIT: if you have administrator privileges and want to read the file that is stored you can inspect TORQUE_HOME/server_priv/jobid.SC
TORQUE_HOME is usually /var/spool/torque but is configurable.

Resources