I have a job running a linux machine managed by slurm.
Now that the job is running for a few hours I realize that I underestimated the time required for it to finish and thus the value of the --time argument I specified is not enough. Is there a way to add time to an existing running job through slurm?
Use the scontrol command to modify a job
scontrol update jobid=<job_id> TimeLimit=<new_timelimit>
Use the SLURM time format, eg. for 8 days 15 hours: TimeLimit=8-15:00:00
Requires admin privileges, on some machines.
Will be allowed to users only if the job is not running yet, on most machines.
To build on the example provided above, you can also use "+" and "-" to increment / decrement the TimeLimit.
From the [scontrol man page][https://slurm.schedmd.com/scontrol.html]:
either specify a new time limit value or precede the time and equal sign with a "+" or "-" to increment or decrement the current time limit (e.g. "TimeLimit+=30")
We regularly get requests like "I need 3 more hours for job XXXXX to finish!!!", which would translate to:
scontrol update job=XXXXX TimeLimit=+03:00:00
Related
Dear UNIX/PBS experts:
I am user of a UNIX HPC system (CentOS Linux 7 (Core),Linux 3.10.0-693.5.2.el7.x86_64) and I do not have any root privileges.
Various jobs have been submitted at an HPC system and almost all resources are being used.
Jobs from other users may run for weeks while my submitted job would finish in less than a day.
My goal is to run my job exactly after the first resources will be freed instead of waiting for
all other users to have their jobs finished.
My submitted job has a number qid 66005.pbs.
However the last job running at this moment has number 55004.pbs.
By checking the status of job: qstat 55005,
I obtain: qstat: Unknown Job Id 55005.pbs
Thus my question is whether it is possible to change the name of job 66005.pbs to 55005.pbs, and if this action will allow my job to run?
If yes, how can this be achieved?
If not, are there any other solutions/alternatives for making sure that my jobs run before those ones of other users in queue?
Thank you very much for your help and any suggestion.
The good thing about the computer system is that it is not human. It will be unfair to run your job (which clearly was submitted after other users) before other users and because of that "No" it is not possible to change your job-id.
You can work with your admin to move the job to a higher priority queue instead.
Using torque user can specify slot limit when submitting the job array by using the %, e.g.: qsub job.sh -t 1-20%5 will create a job array with 20 jobs, but with only 5 running simultaneously.
Currently I work with PBS Professional, but unfortunately, as far as I can see, option % is not supported. How can I achieve similar behavior as % in torque as simple as possible?
I have set of an array job as follows:
sbatch --array=1:100%5 ...
which will limit the number of simultaneously running tasks to 5. The job is now running, and I would like to change this number to 10 (i.e. I wish I'd run sbatch --array=1:100%10 ...).
The documentation on array jobs mentions that you can use scontrol to change options after the job has started. Unfortunately, it's not clear what this option's variable name is, and I don't think it is listed in the documentation of the sbatch command here.
Any pointers well received.
You can change the array throttling limit with the following command:
scontrol update ArrayTaskThrottle=<count> JobId=<jobID>
I have a website that is live. I have a cron job that executes every 24 hours. the cron job fetches and analyzes the data from a database table.
The problem is that the website gets very slow during the time when cron job is running. And gets back to normal after that. It gives me error Too many connections during this time.
I set the maximum allowed connections to 500 in mysql. The number of active connections that I checked in mysql were less than limit during that time.
I am unable to find any relevant help or even a clue to think in a particular direction.
Update:
I noticed one thing. the number of mysql connection continuously increases in this time. Although still less than the maximum limit.
nice command can change priority of a process. You want to lower the priority of the background process so it will try not to execute be executing while the website is being busy. E.g.
0 3 * * * nice -n 20 myjob arg arg
to execute myjob arg arg with lowered priority every day at 3am.
EDIT: Although, if the job is spending most of its time in database queries, this will not affect it much. MySQL has LOW_PRIORITY flag for INSERT and UPDATE statements that will do kind of the same thing for those queries.
I am trying to run qsub jobs on a SGE(Sun Grid Engine) cluster that supports a maximum of 688 jobs. I would like to know if there is any way to find out the total number of jobs that are currently running on the cluster so I can submit jobs based on the current cluster load.
I plan to do something like: sleep for 1 minute and check again if the number of jobs in the cluster is < 688 and then submit jobs further.
And just to clarify my question pertains to knowing the total number of jobs submitted on the cluster not just the jobs I have submitted currently.
Thanks in advance.
You can use qstat to list the jobs of all users; this with awk and wc can be used to find out the total number of jobs on the cluster:
qstat -u "*" | awk '{if ($5 == "r" || $5 == "qw") print $0;}' | wc -l
The above command also takes into account jobs that are queued and waiting to be scheduled on a compute node.
However, the cluster sysadmins could disallow users to check on jobs that don't belong to them. You can verify if you can see other user's jobs by running:
qstat -u "*"
If you know for a fact that another user is running a job and yet you can't see it while running the above command, it's most likely that the sys admins disabled that option.
Afterthought: from my understanding, you're just a regular cluster user - why are you even bothering to submit jobs this way. Why don't you just submit all the jobs that you want and if the cluster can't schedule your jobs, it will just put them in a qw state and schedule them whenever SGE feels is the most appropriate time.
Depending on how cluster is configured, using job array (-t option for qsub) would get around this limit.
I have similar limits set for maximum number of jobs a single user can submit. This limit pertains to individual instances of qsub and not single job array submission potentially many tasks (that limit is set via another configuration variable, max_aj_tasks).