We split our regression tests into several parallel jobs that each produce a json artifact providing some data about the test results. I would like to create a new job that produces a single artifact with all regression results concatenate. The number of parallel jobs is not necessarily static. The most similar post I've seen is this, Get artifacts from previous GIT jobs, but my jobs are parallel jobs that produce the same file name (although I suppose output names can include node number for simplicity).
I've read that a job will automatically download artifacts from previous jobs, but how can I actually load/use this data in a script?
Current Scenario: We have 5 jobs in RT executions and each job is running on 5 different agents. So we have 5 cucumber reports.
Expected Scenario: We need a consolidated Cucumber HTML Report of all 5 jobs.
How it can be resolved? We are working in azure Devops.
The Cucumber Html report is a single page html application. If you open it up in an editor you'll see that it contains a an array of messages.
These messages can also be obtained by using the messages:target/messages.ndjson plugin. If you can obtain in a single job the messages from each parallel job you can then merge these and pass them to the html-formatter.
To merge the files you may have to filter out some messages. Definitively requires writing some code, don't think anyone has done this before, but shouldn't be impossible.
Let's say I have 6233 simulations to run. The commands are generated and stored in a file, one in each line. I would like to use Slurm to schedule and run these commands. However, the MaxArraySize limit is 2000. So I can't use one job array to schedule all of them.
One solution is given here, where we create four separate jobs and use arithmetic indexing into the file, with the last job having a smaller number of tasks to run (233).
Is it possible to do this using one sbatch script with one job ID?
I set ntasks=1 when using job arrays. Do larger ntasks help in such situations?
Following Damien's solution and examples given here, I ended up with the following line in my bash script:
The same can be done using Python (shown in the referenced page). The only difference is that the environment variables should be imported into the script.
Is it possible to do this using one sbatch script with one job ID?
No. That solution will give you multiple job IDs
I set ntasks=1 when using job arrays. Do larger ntasks help in such situations?
Yes, that is a factor that you can leverage.
Each job in the array can spawn multiple tasks (--ntasks=...). In that case, the line number in the command file must be computed from $SLURM_ARRAY_TASK_ID and $SLURM_PROCID, and the program must be started with srun. Each task in a job member of the array will run in parallel. How large the job can be will depend on the MaxJobsize limit defined on the cluster/partition/qos you have access to.
Another option is to chain the tasks inside each job of the array, with a Bash loop (for i in $seq(...) ; do ...; done). In that case, the line number in the command file must be computed from $SLURM_ARRAY_TASK_ID and $i. Each task in a job member of the array will run serially. How large the job can be will depend on the MaxWall limit defined on the cluster/partition/qos you have access to.
This is a bit of a backwards approach to snakemake whose main paradigm is "one job -> one output", but i need many reruns in parallel of my script on the same input matrix on the slurm batch job submission cluster. How do I achieve that?
I tried specifying multiple threads, multiple nodes, each time indicating one cpu per task, but it never submits an array of many jobs, just an array of one job.
I don't think there is a nice way to submit an array job like that. In snakemake, you need to specify a unique output for each job. But you can have the same input. If you want 1000 runs of a job:
ids = range(1000)
rule all:
input: expand('output_{sample}_{id}', sample=samples, id=ids)
rule simulation:
input: 'input_{sample}'
output: 'output_{sample}_{id}'
shell: echo {input} > {output}
If that doesn't help, provide more information about the rule/job you are trying to run.
I have a question for a very specific use case. I'll start by giving a bit of background:
I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.
Now to my questions:
Is there a way to resubmit a job a certain number of times/until a condition is met?
Is there another clever way to train a model iteratively without having to manually submit the job?
For this, you need to submit job with command
llsubmit job.sh
The shell script or batch job file should be executed as manytimes. Once the job finishes, resources are available. it restarts the same script(you already submitted and waiting in queue) automatically.
Here are a few suggestions:
Just train your network. It's up to the scheduler to try not to block the GPUs and running 10 short jobs vs 1 long job will probably lead to the same priority.
You can specify --restart-times to run a job which has failed multiple times. The trick is that snakemake will also remove outputs from failed jobs. The workaround is to checkpoint your model to a temp file (not in the output directive of the rule) and exit your training with an error to signal to snakemake that it needs to run again. The inelegant part is that you have to set your restart to a large value, or make sure your training code knows that it is running the final attempt and needs to save the actual output. You can acquire the attempt as a resource. I'm not sure the parameter is available in other directives. Also any job that fails will be resubmitted; not a great option for development.
You can make your checkpoint files outputs. This again assumes you want to run a set number of times. Your rule all will look for a file like final.checkpoint, which depends on 10.checkpoint, which depends on 9.checkpoint and so on. With a fancy enough input function this can be implemented in one rule where 1.checkpoint depends on nothing (or your training data perhaps).
In my gitlab-ci.yml file, I have defined 3 stages, and the 2nd and 3rd stages have 3 jobs each, resulting in the following structure:
The 1st and 2nd stage works as I intended, however, for the 3rd stage what I'd actually like to have is something like this (the image is a mockup of course), i.e. "parallel sequences" of jobs if you will:
That is, I want "deploy-b" to start if "build-b" is done, and not waiting for the other build tasks to complete.
Is that possible with GitLab pipelines? (Apart from the obvious solution of defining just 2 stages, the second being "Build-and-Deploy", where I just "merge" the script steps of the current build-* and deploy-* jobs.)
This feature was added in the new GitLab release (v12.2)
No, this is not possible by design, a next stage only start if the previous one is done for GitLab version<12.2.