PBS/Torque - Couldn't delete completed job status information - pbs

The command 'qstat -a' outputs lots of lines of information for completed jobs all with status 'C'. It seems that they will stay forever. How to cleanup these unneeded job information since those jobs are already 'completed'? Thanks!

This is controlled by the qmgr parameter keep_completed. keep_completed specifies a number of seconds after completion a job should continue to be visible. If you would like to immediately delete a job without waiting this amount of time, you can execute
qdel -p <jobid>

Type qstat -r to get only the running jobs

Related

Why is the status of PBS job with dependency 'Hold', not 'Queue'?

I used following command to submit my dependent job.
qsub current_job_file -W depend=afterany:previous_job_id
Then I find out my current job is under status 'H'. And it won't automatically run after the previous job finished. Is it how it suppose to be or I made a mistake somewhere? How can I let it run automatically after the previous job finish?
I also tried the following command. The result is the same.
qsub -W depend=afterany:previous_job_id current_job_file
That is how it is supposed to be. If your current job is dependent on another job and the dependency is "after" then it will be held until the other job finishes (or starts depending on what kind of dependency you have used. In your case it is "any" so it will wait for the other job to finish) and then move your current job to "Q" (queued) state for PBS scheduler to consider the job for running.

Automatic qsub job completion status notification

I have a shell script that calls five other scripts from it. The first script creates 50 qsub jobs in the cluster. Individual job execution time varies from a couple of minutes to an hour. I need to know when all the 50 jobs get finished because after completing all the jobs I need to run the second script. How to find whether all the qsub jobs are completed or not? One possible solution can be using an infinite loop and check job status by using qstate command with job ID. In this case, I need to check the job status continuously. It is not an excellent solution. Is it possible that after execution, qsub job will notify me by itself. Hence, I don't need to monitor frequently job status.
qsub is capable of handling job dependencies, using -W depend=afterok:jobid.
e.g.
#!/bin/bash
# commands to run on the cluster
COMMANDS="script1.sh script2.sh script3.sh"
# intiliaze JOBID variable
JOBIDS=""
# queue all commands
for CMD in $COMMANDS; do
# queue command and store the job id
JOBIDS="$JOBIDS:`qsub $CMD`"
done
# queue post processing, depended on the submitted jobs
qsub -W depend=afterok:$JOBIDS postprocessing.sh
exit 0
More examples can be found here http://beige.ucs.indiana.edu/I590/node45.html
I never heard about how to do that, and I would be really interested if someone came with a good answer.
In the meanwhile, I suggest that you use file tricks. Either your script outputs a file at the end, or you check for the existence of the log files (assuming they are created only at the end).
while [ ! -e ~/logs/myscript.log-1 ]; do
sleep 30;
done

Does a Cron job overwrite itself

I can't seem to get my script to run in parallel every minute via cron on Ubuntu 14.
I have created a cron job which executes every minute. The cron job executes a script that runs much longer than a minute. When a minute expires it seems the new cron execution overwrites the previous execution. Is this correct? Any ideas welcomed.
I need concurrent independent running jobs. The cron job runs a script which queries a mysql database. The idea is to poll a db- if yes execute script in its own process.
cron will not stop a previous execution of a process to start a new one. cron will simply kick off the new process even though the old process is still running.
If you need cron to terminate the previous process, you'll need to modify your script to handle that itself.
You need a locking mechanism to identify that the script is already running.
There are several ways of doing this but you need to be careful to use an atomic method.
I use lock directories as creating a directory is guaranteed to be atomic -
LOCKDIR=/tmp/myproc.lock
if ! mkdir $LOCKDIR >/dev/null 2>&1
then
print -u2 "Processing already running - terminating"
exit 1
fi
trap "rm -rf $LOCKDIR" EXIT
This is a common occurrence. Try adding a check in your script to see if a lockfile already exists. If it does, exit. If not, continue.
Cronjobs are not overrun. They do however have the possibility of overlapping. Unless your script explicitly kills any pre-existing process, it shouldn't be able to stop the previously running script.
However, introducing the concept of lockfiles will save you from all these confusions altogether.

how to retrieve a job information LSF archive

We execute our job application through bsub command in Linux
OS.
when the job completes, what is the command to retrieve the job information from the LSF archive. i know there is command like bacct jobNo. But it does not retrieve the information.
Please help.
bacct retrieves summary information about sets of finished jobs for the purposes of accounting -- it gives you info like average turnaround time, resource usage etc.
I think what you might be looking for is bhist -l <jobid>, which will give you the historical information about that job's submission and execution (similar to bjobs -l but more detailed and works for jobs that finished long ago).

slurm job scheduler sacct show only pending and running jobs no prolog

I am quite new to slurm. I am looking on how to display ONLY current running and pending jobs, no prolog.
> sacct -s PD,R
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
5049168 SRR600493 general cluster_u+ 1 RUNNING 0:0
5049168.0 prolog cluster_u+ 1 COMPLETED 0:0
Why is it printing the prolog and what the prolog is?
You should use squeue for that, rather than sacct. squeue will list running and pending jobs, and will be able to display more information (requested resources, etc.) than sacct. And squeue will not show job steps (like 'prolog' here)
When you submit a job with slurm there are two things that happen. First, it allocates resources and then when you launch something on this resource, it creates a step.
So the two lines you are showing belong to the same job. The first line is the allocation and the second is the first step. So someone launched a step with a binary named prolog, this step is now finished but the allocation of the resource is not released. The user probably ran salloc first and then srun.
If you think that nobody launched a binary named prolog it's maybe that you have configured a prolog on slurm to be run at each first step of a job.

Resources