Make sacct to not truncate the SLURM_ARRAY_TASK_ID - slurm

I need to find the taskID of my job array that got a timeout. So I use sacct as follows.
sacct -u <UserID> -j <jobID> -s TIMEOUT
and I get this as output.
User JobID start
---- ----- -----
<UserID> <JobID>_90+ ....
My taskID is a four-digit number that has been truncated by sacct and displayed as 90+ instead. How can I get the full taskID?

The sacct command has a --format parameter that allows customising the columns shown, along with their size.
The following will show the same three columns as your example, with a 30-character wide column for jobid:
sacct -u <UserID> -j <jobID> -s TIMEOUT --format user,jobid%-30,start

Related

How to get count of failed and completed jobs in an array job of SLURM

I am running multiple array jobs using slurm. For a given array job id, let's say 885881, I want to list the count of failed and completed number of jobs. Something like this:
Input:
<some-command> -j 885881
Output: Let's say we have 200 jobs in the array.
count | status
120 | failed
80 | completed
Secondly, it'd be great if I can get the unique list of reasons due to which tasks failed.
Input:
`<some-command> -j 885881`
Output:
count | reason
80 | OUT_OF_MEMORY
40 | TIMED_OUT
I believe sacct command can be utilized to somehow get these results, but not sure how.
With a one-liner like this one, you can get both information at the same time
$ sacct -n -X -j 885881 -o state%20 | sort | uniq -c
16 COMPLETED
99 FAILED
32 OUT_OF_MEMORY
1 PENDING
The sacct command digs into the accounting information. The -n -X parameters are used to simplify the output and reduce the number of unnecessary lines, and the -o parameter requests only the STATE column to be displayed. Then the output is fed into the sort and uniq commands which do the counting.
If you really need two separate commands, you can adapt the above one-liner easily. And you can make it a script or a Bash function for ease of use.
If you would like a more elaborate solution, you can have a look at smanage and at atools

SLURM - sort squeue by job-names

Is there a way to get the output of squeue ordered by the job-name? I know I can sort, e.g. by job-id via
squeue --sort=+i
but I don't see how to sort by the job-name (given by #SBATCH --job-name XXX).
As %j is the type specification for the job name in --format strings, sorting by job name would be
squeue --sort=+j

How can I know the node name for my running job on slurm

Is there is any command that will return the node name and the details for my running job on the cluster that uses slurm scheduler?
squeue should provide this information directly via the -o option. To get the allocated nodes, the corresponding format specifier is %N, thus for example to retrieve this information for a job with id 1000:
squeue -h -o "%N" -j1000
Or to request this for all jobs of a particular user:
squeue -h -o "%A %N" -u user_name
Here, %A returns the corresponding job id. The -h option removes the header in the output...

How to find from where a job is submitted in SLURM?

I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like
[myUserName#rclogin06 ~]$ sacct -u myUserName
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12577766 run.sh general ourQueue_+ 4 RUNNING 0:0
12659777 run.sh general ourQueue_+ 8 RUNNING 0:0
12675983 run.sh general ourQueue_+ 16 RUNNING 0:0
How can I know from which directory a job is submitted so that I can differentiate the jobs?
You can use the scontrol command to see the job details. $ scontrol show job <jobid>
For example, for a running job on our SLURM cluster:
$ scontrol show job 1665191
JobId=1665191 Name=tasktest
...
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/work/.../slurm_test/task.submit
WorkDir=/lustre/work/.../slurm_test
You are looking for the last line, WorkDir.
The latest version of Slurm now offers that information through squeue with :
squeue --format "%Z"
that displays, according to the man page,
%Z The job’s working directory.
In order to list the work directory of past jobs that are no longer accessible via squeue or scontrol, you can use sacct:
sacct -S 2020-08-10 -u myUserName --format "jobid,jobname%20,workdir%70"
Lists job id, job name and work directory of all jobs of user myUserName since August 10th, 2020.

How to write job file for cluster computer?

I am very new to high performance computers. In my laptop, I can run a program by type the command like this "./prog". But in HPC, I am required to write a job file before running any job. I was given a simple job file to get start with. But it really confuses me:
#!bin/bash
# BSUB -q
#BSUB -o outfile -R “mem>10”
myjob arg1 arg2
#BSUB -J myjob
Any one have any experience dealing with this type of HPC? Thanks a lot.
Your HPC system is called LAVA queue system, developed by HP, if I remember it correctly.
I hope the following can answer your question:
Lava frequently used commands at a glance**
bhosts -w
Show status of each node in one line
bhosts -l
Show details of status of each node
bqueues [-l -w -r]
Show status of each node by users’ group
bparams [-l -h -V]
Show available parameters for bjobs
lsid
display current lava version number
lsinfo
display load sharing information
lshotst
display hosts and their static resource information
lsload
display load information for hosts
bjobs -u all
display job info
bjobs <job-id>
display job info by its ID
bjobs -a
display all jobs
bjobs -r/-p/-s
show jobs that are running/paused/suspended
bjobs -l
show more details
bhist
show job history
Submit jobs
bsub my_job
bsub < myscript
bsub -n 4 myjob
bkill 1234
bstop 1234
bresume 1234
More reference can be found at:
http://ccls.lab.sfsu.edu/bin/view/Cluster/LavaSchedulerInformation

Resources