ClearCase does not work in conjunction with LSF distributed multi-host parallel job if more than 1 hosts are specified.
Reason: ClearCase does not mount the file system on all hosts when dispatching multi-host simulations to the LSF system
the job is terminated because included files are not found or cannot be output because the file system does not exist on all hosts.
The ClearCase + LSF implementation has to guarantee by construction that the job is dispatched correctly in 100% of all cases, which is currently not the case.
please help me on this issue.
The LSF/Clearcase integration uses the daemon.wrap program to set the view on the execution host and then launch the job inside the view. That wrapper doesn't support cross-host parallel jobs.
You'll have to try to work around the limitation in your job script. You can disable the daemon wrapper by making sure the $CLEARCASE_ROOT is not set in your job submission environment. Then in the job script, in the execution environment, and in each process that is participating in the parallel job the job script can call cleartool setview <options> <real job command>.
If you launch your job with blaunch then it might make things easier. Without blaunch, LSF will start a single process on the first execution host. With blaunch, LSF will launch one process per slot, and launch it on all of the allocated execution hosts. With blaunch, each process can then set the view and start the real job.
Good luck!
Related
We have 4 deploy jobs in the same stage that can be run concurrently. From the Gitlab docs:
The ordering of elements in stages defines the ordering of jobs' execution:
Jobs of the same stage are run in parallel.
Jobs of the next stage are run after the jobs from the previous stage complete successfully.
What happens, however, is that only one of the jobs run at a time and the others stay on pending. Is there perhaps other things that I need to do in order to get it to execute in parallel. I'm using a runner with a shell executor hosted on an Ubuntu 16.04 instance.
Your runner should be configured to enable concurrent jobs( see https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
concurrent = 4
or you may want to setup several runners.
I also ran into this problem. I needed to run several tasks at the same time. I used everything I could find (from needs to parallel). however, my tasks were still performed sequentially. every task I had was on standby. the solution turned out to be very simple. open file /etc/gitlab-runner/config.toml concurent for the required number of parallel tasks for you.
I use PBS job arrays to submit a number of jobs. Sometimes a small number of jobs get screwed up and not been ran successfully. Is there a way to automatically detect the failed jobs and restart them?
pbs_server supports automatic_requeue_exit_code:
an exit code, defined by the admin, that tells pbs_server to requeue the job instead of considering it as completed. This allows the user to add some additional checks that the job can run meaningfully, and if not, then the job script exits with the specified code to be requeued.
There is also a provision for requeuing jobs in the case where the prologue fails (see the prologue/epilogue script documentation).
There are probably more sophisticated ways of doing this, but they would fall outside the realm of built-in Torque options.
I have a job running Integration Test on Slaves with label "IntegrationTest".
When the Job started, I use Node.setLabelString(String) to modify the Slave Label from IntegrationTest to "_out_IntegrationTest", in order to block this slave from running next round integration test, because we need to revert this slave to clean environment before integration test.
The problem is that the following Integration Test job in queue still can take this slave to run even though its label was set to "_out_IntegrationTest" by Node.setLabelString(String).
I am not sure the Label is modified, but queue is not aware of this.
When I modify slave Label to "_out_IntegrationTest" manually from Web UI, the Integration Test Job will not run on this slave.
When Slave Label modified by Node.setLabelString(String), the Job still can run on it.
Note:
Queue.maintain() is called after Node.setLabelString(String) invoked.
I have a similar situation in my workplace where I need to reboot a slave after running a job. So I have a separate job that reboots the slave it runs on. As the first build step in the main job, I request to schedule a parameterized build of the reboot job, using the parameter "build on the same node". The reboot job is configured with a very high priority (via Priority Sorter Plugin) so that it will be guaranteed to be the next job run on that node. Then I have the other build steps of the main job to perform its required tasks.
When the main job finishes, the reboot job comes in and reboots the slave - here you could perform the reversion of your slave to the clean environment.
I am using /torque/4.2.5 to schedule my jobs and I need to find a way to make a copy of my PBS launching script that I used for jobs that are currently queueing or running. The plan is to make a copy of that launch script in the output folder.
TORQUE has a job logging feature that can be configured to record job scripts used at launch time.
EDIT: if you have administrator privileges and want to read the file that is stored you can inspect TORQUE_HOME/server_priv/jobid.SC
TORQUE_HOME is usually /var/spool/torque but is configurable.
What happens if I would try to run a multithreaded job in 1 SGE slot? Would it fail to start multiple threads? Or would it still start these multiple threads and potentially overload the SGE cluster node, because it is going to run more threads than there are slots?
I know I should use the -pe threaded nrThreads parameter. But I am running a program of which I am not sure how many threads it is using for every step.
It's been a while since I've used SGE, but at least back then, a job which launched more computational threads than allocated would not be prevented from launching those threads, usually then stealing CPU time from other jobs.
Perhaps current SGE versions are capable of using cpusets, which allow the administrator to limit the CPU's used by a job. At least the slurm scheduler can do this.