I'm running a snakemake pipeline on Slurm and am observing a strange error:
Failed to solve the job scheduling problem with pulp
Without SLURM, the pipeline works perfectly fine. However, when I try to run it on SLURM, the job scheduling is strange, the scheduler skips the first Job (Job 0) and directly jumps to Job1. Since Job 0 was missed, there are no input files for Job 1.
Any help/direction would be much appreciated.
Related
I'm seeing some of my jobs stuck waiting in "Running command..." stage even after all tasks have finished and there are "0" running.
What might be causing this?
Which logs should I be looking at to resolve this?
Thanks
Screenshot:
I accidentally removed a job submission script for a Slurm job in terminal using rm command. As far as I know there are no (relatively easy) ways of recovering that file anymore, and I hadn't saved it anywhere. I have used that job submission script many many times before, so there are a lot of Slurm job submissions (all of them finished) that have used it. Is it possible to recover that job script from an old finished job somehow?
If Slurm is configured with the ElasticSearch plugin, then you will find the submission script for all completed jobs in the ElasticSearch instance used in the setup.
Another option is to install sarchive
I'm running a PBS job (python) in the cluster using qsub command. I'm curious to know how can I restart the same job from the step where it failed?
Any type of help will be highly appreciated.
Most likely, you cannot.
Restarting a job requires a checkpoint file.
For this, checkpointing support has to be explicitly configured on your HPC environment and then the job has to be submitted with additional command-line arguments.
See
http://docs.adaptivecomputing.com/torque/3-0-5/2.6jobcheckpoint.php
We have 4 deploy jobs in the same stage that can be run concurrently. From the Gitlab docs:
The ordering of elements in stages defines the ordering of jobs' execution:
Jobs of the same stage are run in parallel.
Jobs of the next stage are run after the jobs from the previous stage complete successfully.
What happens, however, is that only one of the jobs run at a time and the others stay on pending. Is there perhaps other things that I need to do in order to get it to execute in parallel. I'm using a runner with a shell executor hosted on an Ubuntu 16.04 instance.
Your runner should be configured to enable concurrent jobs( see https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
concurrent = 4
or you may want to setup several runners.
I also ran into this problem. I needed to run several tasks at the same time. I used everything I could find (from needs to parallel). however, my tasks were still performed sequentially. every task I had was on standby. the solution turned out to be very simple. open file /etc/gitlab-runner/config.toml concurent for the required number of parallel tasks for you.
I use PBS job arrays to submit a number of jobs. Sometimes a small number of jobs get screwed up and not been ran successfully. Is there a way to automatically detect the failed jobs and restart them?
pbs_server supports automatic_requeue_exit_code:
an exit code, defined by the admin, that tells pbs_server to requeue the job instead of considering it as completed. This allows the user to add some additional checks that the job can run meaningfully, and if not, then the job script exits with the specified code to be requeued.
There is also a provision for requeuing jobs in the case where the prologue fails (see the prologue/epilogue script documentation).
There are probably more sophisticated ways of doing this, but they would fall outside the realm of built-in Torque options.