I tried adding a job to sqe by qsub. But it seems to be stuck. The state is shown as 'dt'. What could be wrong? I cannot add run any more jobs due to this. How can I remove the job from queue?
Looks like a grid engine status. If that is the case the 'd' means the job is in the process of being deleted while the 't' means the job is being transfered to the node where it is supposed to run. This combination usually occurs only if the node crashed while the job was being transfered to it.
You should be able to delete it with qdel -f JOBID if you are the administrator or the cluster is appropriately configured. If not ask your sysadmin/support to do it for you.
Related
Say I want to run a job on the cluster: job1.m
Slurm handles the batch jobs and I'm loading Mathematica to save the output file job1.csv
I submit job1.m and it is sitting in the queue. Now, I edit job1.m to have different variables and parameters, and tell it to save data to job1_edited.csv. Then I re-submit job1.m.
Now I have two batch jobs in the queue.
What will happen to my output files? Will job1.csv be data from the original job1.m file? And will job1_edited.csv be data from the edited file? Or will job1.csv and job1_edited.csv be the same output?
:(
Thanks in advance!
I am assuming job1.m is a Mathematica job, run from inside a Bash submission script. In that case, job1.m is read when the job starts so if it is modified after submission but before job start, the modified version will run. If it is modified after the job starts, the original version will run.
If job1.m is the submission script itself (so you run sbatch job1.m), that script is copied in a spool directory specific to the job so if it is modified after the job is submitted, it still will run the original version.
In any case, it is better, for reproducibility and traceability, to make use of a workflow manager such as Fireworks, or Bosco
Sometimes Crons is working sometimes getting missed. I have attached all setting and result. Anyone can check and revert.
It's completely normal behaviour. Some jobs are skipped caused the time frame is out of scheduled time for specified cron job. In your case the reindex process is scheduled every 1 minute. If there is more things to index (lot of changes on products, categories etc.) one minute is's not enough to complete. Also there is only one process per cron group, in your case index. Use Separate Process in cron configuration means that indexes process will run as separate process in relation to other cron groups.
Dear UNIX/PBS experts:
I am user of a UNIX HPC system (CentOS Linux 7 (Core),Linux 3.10.0-693.5.2.el7.x86_64) and I do not have any root privileges.
Various jobs have been submitted at an HPC system and almost all resources are being used.
Jobs from other users may run for weeks while my submitted job would finish in less than a day.
My goal is to run my job exactly after the first resources will be freed instead of waiting for
all other users to have their jobs finished.
My submitted job has a number qid 66005.pbs.
However the last job running at this moment has number 55004.pbs.
By checking the status of job: qstat 55005,
I obtain: qstat: Unknown Job Id 55005.pbs
Thus my question is whether it is possible to change the name of job 66005.pbs to 55005.pbs, and if this action will allow my job to run?
If yes, how can this be achieved?
If not, are there any other solutions/alternatives for making sure that my jobs run before those ones of other users in queue?
Thank you very much for your help and any suggestion.
The good thing about the computer system is that it is not human. It will be unfair to run your job (which clearly was submitted after other users) before other users and because of that "No" it is not possible to change your job-id.
You can work with your admin to move the job to a higher priority queue instead.
We have been using agenda for sometime in our node.js server and are confused about how the job locking mechanism works in agenda.
Sometimes we see that the 'lockedAt' field for a job in the database has a non-null value and it changes to null suddenly.
Sometimes, the 'lockedAt' value stays as null and jobs get stuck.
Sometimes, after 10 minutes the 'lockedAt' value gets updated, but still the job is stuck.
And if I restart the node.js server, all the locked jobs get unlocked and get executed properly.
Why is this locking mechanism needed ? What does it do ?
When you running a Job, the db entry will automatically set a lockedAt timestamp, this ensure no other node instance will run the same job. It means "hey I am working on this job, don't touch it"
When another node instance read the job, it checks current time against the lock timestamp, if it has been less than 10 minutes, it will not run the job.
This "10 minutes" is stored in agenda global config or job config options, both are from code. You can change it by setting defaultlocklifetimenumber to a different value.
When we submit a job via sbatch, the pid to jobs given by incremental order. This order start from again from 1 based on my observation.
sbatch -N1 run.sh
Submitted batch job 20
//Goal is to change submitted batch job's id, if possible.
[Q1] For example there is a running job under slurm. When we reboot the node, does the job continue running? and does its pid get updated or stay as it was before?
[Q2] Is it possible to give or change pid of the submitted job with a unique id that the cluster owner want to give?
Thank you for your valuable time and help.
If the node fails, the job is requeued - if this is permitted by the JobRequeue parameter in slurm.conf. It will get the same Job ID as the previously started run since this is the only identifier in the database for managing the jobs. (Users can override requeueing with the --no-requeue sbatch parameter.)
It's not possible to change Job ID's, no.