I have a job running Integration Test on Slaves with label "IntegrationTest".
When the Job started, I use Node.setLabelString(String) to modify the Slave Label from IntegrationTest to "_out_IntegrationTest", in order to block this slave from running next round integration test, because we need to revert this slave to clean environment before integration test.
The problem is that the following Integration Test job in queue still can take this slave to run even though its label was set to "_out_IntegrationTest" by Node.setLabelString(String).
I am not sure the Label is modified, but queue is not aware of this.
When I modify slave Label to "_out_IntegrationTest" manually from Web UI, the Integration Test Job will not run on this slave.
When Slave Label modified by Node.setLabelString(String), the Job still can run on it.
Note:
Queue.maintain() is called after Node.setLabelString(String) invoked.
I have a similar situation in my workplace where I need to reboot a slave after running a job. So I have a separate job that reboots the slave it runs on. As the first build step in the main job, I request to schedule a parameterized build of the reboot job, using the parameter "build on the same node". The reboot job is configured with a very high priority (via Priority Sorter Plugin) so that it will be guaranteed to be the next job run on that node. Then I have the other build steps of the main job to perform its required tasks.
When the main job finishes, the reboot job comes in and reboots the slave - here you could perform the reversion of your slave to the clean environment.
Related
ClearCase does not work in conjunction with LSF distributed multi-host parallel job if more than 1 hosts are specified.
Reason: ClearCase does not mount the file system on all hosts when dispatching multi-host simulations to the LSF system
the job is terminated because included files are not found or cannot be output because the file system does not exist on all hosts.
The ClearCase + LSF implementation has to guarantee by construction that the job is dispatched correctly in 100% of all cases, which is currently not the case.
please help me on this issue.
The LSF/Clearcase integration uses the daemon.wrap program to set the view on the execution host and then launch the job inside the view. That wrapper doesn't support cross-host parallel jobs.
You'll have to try to work around the limitation in your job script. You can disable the daemon wrapper by making sure the $CLEARCASE_ROOT is not set in your job submission environment. Then in the job script, in the execution environment, and in each process that is participating in the parallel job the job script can call cleartool setview <options> <real job command>.
If you launch your job with blaunch then it might make things easier. Without blaunch, LSF will start a single process on the first execution host. With blaunch, LSF will launch one process per slot, and launch it on all of the allocated execution hosts. With blaunch, each process can then set the view and start the real job.
Good luck!
We have 4 deploy jobs in the same stage that can be run concurrently. From the Gitlab docs:
The ordering of elements in stages defines the ordering of jobs' execution:
Jobs of the same stage are run in parallel.
Jobs of the next stage are run after the jobs from the previous stage complete successfully.
What happens, however, is that only one of the jobs run at a time and the others stay on pending. Is there perhaps other things that I need to do in order to get it to execute in parallel. I'm using a runner with a shell executor hosted on an Ubuntu 16.04 instance.
Your runner should be configured to enable concurrent jobs( see https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
concurrent = 4
or you may want to setup several runners.
I also ran into this problem. I needed to run several tasks at the same time. I used everything I could find (from needs to parallel). however, my tasks were still performed sequentially. every task I had was on standby. the solution turned out to be very simple. open file /etc/gitlab-runner/config.toml concurent for the required number of parallel tasks for you.
I use PBS job arrays to submit a number of jobs. Sometimes a small number of jobs get screwed up and not been ran successfully. Is there a way to automatically detect the failed jobs and restart them?
pbs_server supports automatic_requeue_exit_code:
an exit code, defined by the admin, that tells pbs_server to requeue the job instead of considering it as completed. This allows the user to add some additional checks that the job can run meaningfully, and if not, then the job script exits with the specified code to be requeued.
There is also a provision for requeuing jobs in the case where the prologue fails (see the prologue/epilogue script documentation).
There are probably more sophisticated ways of doing this, but they would fall outside the realm of built-in Torque options.
For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure
I have a question want to ask for help.
My spark structure: 1 master, 2 slaves.
I have a stream job deploy from master to two slaves, then one executor run this task the other pending.
Goal
Due to my job often meets OOM problem, so I want my slaves takeover to execute this job.
Problem
When one slave crashes, its status always LAUNCHING, so I have to re-run ./start-slave.sh to recover it, but it's not a smart way to solve it I think, so
I want to automatically re-launch slave when it's crashed by submitted job.