SLURM: How to view completed jobs full name? - slurm

sacct -n returns all job's name trimmed for example" QmefdYEri+.
[Q] How could I view the complete name of the job, instead of its trimmed version?
--
$ sacct -n
1194 run.sh debug root 1 COMPLETED 0:0
1194.batch batch root 1 COMPLETED 0:0
1195 run_alper+ debug root 1 COMPLETED 0:0
1195.batch batch root 1 COMPLETED 0:0
1196 QmefdYEri+ debug root 1 COMPLETED 0:0
1196.batch batch root 1 COMPLETED 0:0

I use the scontrol command when I am interested in one particular jobid as shown below (output of the command taken from here).
$ scontrol show job 106
JobId=106 Name=slurm-job.sh
UserId=rstober(1001) GroupId=rstober(1001)
Priority=4294901717 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
StartTime=2013-01-26T12:55:02 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=atom-head1:3526
ReqNodeList=(null) ExcNodeList=(null)
NodeList=atom01
BatchHost=atom01
NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/rstober/slurm/local/slurm-job.sh
WorkDir=/home/rstober/slurm/local
If you want to use sacct, you can modify the number of characters that are displayed for any given field as explained in the documentation:
-o, --format Comma separated list of fields. (use "--helpformat" for a list of available fields). NOTE: When using the format option for
listing various fields you can put a %NUMBER afterwards to specify how
many characters should be printed.
e.g. format=name%30 will print 30 characters of field name right
justified. A %-30 will print 30 characters left justified.
Therefore, you can do something like this:
sacct --format="JobID,JobName%30,Partition,Account,AllocCPUS,State,ExitCode"
if you want the JobName row to be 30-characters wide.

Related

OpenCV Training Custom Haar Cascade

I am trying to train a haar cascade on my face. I have everything setup including the positive, negetive, the vec file, etc. but when I run the opencv_traincascade, it gave me a terminate called after throwing an instance of 'std::bad_alloc' error. Then I added this line to my arguments -nonsym -mem 512 and it gave me this error: terminate called after throwing an instance of 'std::logic_error'.
Here is the command I am running:
opencv_traincascade -data classifier -vec samples.vec -bg negatives.txt\
> -numStages 20 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -numPos 1000\
> -numNeg 600 -w 80 -h 40 -mode ALL -precalcValBufSize 1024\
> -precalcIdxBufSize 1024\
> -nonsym\
> -mem 512\
Any help would be greatly appreciated!
You have to get rid of the -nonsym -mem 512 and instead put -mode ALL. So new the command looks like this:
opencv_traincascade -data classifier -vec samples.vec -bg negatives.txt\
> -numStages 20 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -numPos 1000\
> -numNeg 600 -w 80 -h 40 -mode ALL -precalcValBufSize 1024\
> -precalcIdxBufSize 1024\
> -mode ALL
The -nonsym -mem512 commands don't actually exist.

Cancel jobs running on the same partition on SLURM

with the commands
$>squeue -u mnyber004
I can visualize all the submitted jobs on my cluster account (slurm)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16884 ada CPUeq6 mnyber00 R 1-01:26:17 1 srvcnthpc105
16882 ada CPUeq4 mnyber00 R 1-01:26:20 1 srvcnthpc104
16878 ada CPUeq2 mnyber00 R 1-01:26:31 1 srvcnthpc104
20126 ada CPUeq1 mnyber00 R 22:32:28 1 srvcnthpc103
22004 curie WRI_0015 mnyber00 R 16:11 1 srvcnthpc603
22002 curie WRI_0014 mnyber00 R 16:13 1 srvcnthpc603
22000 curie WRI_0013 mnyber00 R 16:14 1 srvcnthpc603
How to cancel all the jobs running on the partition ada?
In your case, scancel offers the appropriate filters, so you can simply run
scancel -u mnyber004 -p ada
Should it not have been the case, a frequent idiom is to use the more powerful filtering properties of squeue and the --format option to build the proper command and then feed it to sh:
squeue -u mnyber004 -p ada --format "scancel %i" | sh
You can play it safer by first saving to a file and then sourcing the file.
squeue -u mnyber004 -p ada --format "scancel %j" > /tmp/remove.sh
source remove.sh

Specify number of CPUs for a job on SLURM

I would like to run multiple jobs on a single node on my cluster. However, when I submit a job, it takes all available CPUs and so remaining jobs are queued. As an example, I made a script that request few resources and submit two jobs that are supposed to run at the same time.
#! /bin/bash
variable=$(seq 0 1 1)
for l in ${variable}
do
run_thread="./run_thread.sh"
cat << EOF > ${run_thread}
#! /bin/bash
#SBATCH -p normal
#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 1
#SBATCH --threads-per-core 1
#SBATCH --mem=10G
sleep 120
EOF
sbatch ${run_thread}
done
However, one job is running and the other user is pending:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
57 normal run_thre user PD 0:00 1 (Resources)
56 normal run_thre user R 0:02 1 node00
The cluster only has one node with 4 sockets with 12 cores and 2 threads each. the output of command scontrol show jobid #job is the following:
JobId=56 JobName=run_thread.sh
UserId=user(1002) GroupId=user(1002) MCS_label=N/A
Priority=4294901755 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2018-03-24T15:34:46 EligibleTime=2018-03-24T15:34:46
StartTime=2018-03-24T15:34:46 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=node00:13047
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node00
BatchHost=node00
NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=48,mem=10G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=./run_thread.sh
WorkDir=/home/user
StdErr=/home/user/slurm-56.out
StdIn=/dev/null
StdOut=/home/user/slurm-56.out
Power=
And the output of scontrol show partition is:
PartitionName=normal
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node00
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
There is something I don't get with the SLURM system. How can I use only 1 CPU per job and run 48 jobs on the node at the same time?
Slurm is probably configured with
SelectType=select/linear
which means that slurm allocates full nodes to jobs and does not allow node sharing among jobs.
You can check with
scontrol show config | grep SelectType
Set a value of select/cons_res to allow node sharing.

slurm completes job without executing

I'm fairly new to slurm. I couldn't find my problem in any forum, so I guess either its very simple or very unnusual (or I don't know how to search).
The script I'm submitting is
#!/bin/bash
#
#SBATCH -p all # partition (queue)
#SBATCH -N 1 # number of nodes
#SBATCH -n 1 # number of cores
#SBATCH -o ./slurm.%N.%j.out # STDOUT
#SBATCH -e ./slurm.%N.%j.err # STDERR
#SBATCH -t 300
#SBATCH --mem=5000
./kzsqrt 10.0
When I use
$ squeue -u rmelo
the queue is empty. If I try
$ show jobid -dd 157
the result is
JobId=157 Name=script_10.0.sh
UserId=rmelo(508) GroupId=rmelo(509)
Priority=4294901747 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:1
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=05:00:00 TimeMin=N/A
SubmitTime=2017-05-07T16:00:45 EligibleTime=2017-05-07T16:00:45
StartTime=2017-05-07T16:00:45 EndTime=2017-05-07T16:00:45
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=all AllocNode:Sid=headnode:20528
ReqNodeList=(null) ExcNodeList=(null)
NodeList=service1
BatchHost=service1
NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=service1 CPU_IDs=0-11 Mem=0
MinCPUsNode=1 MinMemoryNode=5000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/rmelo/modeloantigo/script_10.0.sh
WorkDir=/home/rmelo/modeloantigo
So my job is finishing instantly, without doing nothing. It doesn't even create the output file specified with #SBATCH -o. I've tried simple commands instead of the program i intend to run, like echo or sleep, with same result.
Any help or source to learn is appreciated.

how to extract numbers in the same location from many log files

I got an file test1.log
04/15/2016 02:22:46 PM - kneaddata.knead_data - INFO: Running kneaddata v0.5.1
04/15/2016 02:22:46 PM - kneaddata.utilities - INFO: Decompressing gzipped file ...
Input Reads: 69766650 Surviving: 55798391 (79.98%) Dropped: 13968259 (20.02%)
TrimmomaticSE: Completed successfully
04/15/2016 02:32:04 PM - kneaddata.utilities - DEBUG: Checking output file from Trimmomatic : /home/liaoming/kneaddata_v0.5.1/WGC066610D/WGC066610D_kneaddata.trimmed.fastq
04/15/2016 05:32:31 PM - kneaddata.utilities - DEBUG: 55798391 reads; of these:
55798391 (100.00%) were unpaired; of these:
55775635 (99.96%) aligned 0 times
17313 (0.03%) aligned exactly 1 time
5443 (0.01%) aligned >1 times
0.04% overall alignment rate
and the other files in the same format but different contents,like test2.log,test3.log to test60.log
I would like to extract two numbers from these files.For example the test1.log, the two numbers would be 55798391 55775635.
So the final generated file counts.txt would be something like this:
test1 55798391 55775635
test2 51000000 40000000
.....
test60 5000000 30000000
awk to the rescue!
$ awk 'FNR==9{f=$1} FNR==10{print FILENAME,f,$1}' test{1..60}.log
if not in the same directory, either call within a loop or create the file list and pipe to xargs awk
$ for i in {1..60}; do awk ... test$i/test$i.log; done
$ for i in {1..60}; do echo test$i/test$i.log; done | xargs awk ...

Resources