slurm completes job without executing - slurm

I'm fairly new to slurm. I couldn't find my problem in any forum, so I guess either its very simple or very unnusual (or I don't know how to search).
The script I'm submitting is
#!/bin/bash
#
#SBATCH -p all # partition (queue)
#SBATCH -N 1 # number of nodes
#SBATCH -n 1 # number of cores
#SBATCH -o ./slurm.%N.%j.out # STDOUT
#SBATCH -e ./slurm.%N.%j.err # STDERR
#SBATCH -t 300
#SBATCH --mem=5000
./kzsqrt 10.0
When I use
$ squeue -u rmelo
the queue is empty. If I try
$ show jobid -dd 157
the result is
JobId=157 Name=script_10.0.sh
UserId=rmelo(508) GroupId=rmelo(509)
Priority=4294901747 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:1
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=05:00:00 TimeMin=N/A
SubmitTime=2017-05-07T16:00:45 EligibleTime=2017-05-07T16:00:45
StartTime=2017-05-07T16:00:45 EndTime=2017-05-07T16:00:45
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=all AllocNode:Sid=headnode:20528
ReqNodeList=(null) ExcNodeList=(null)
NodeList=service1
BatchHost=service1
NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=service1 CPU_IDs=0-11 Mem=0
MinCPUsNode=1 MinMemoryNode=5000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/rmelo/modeloantigo/script_10.0.sh
WorkDir=/home/rmelo/modeloantigo
So my job is finishing instantly, without doing nothing. It doesn't even create the output file specified with #SBATCH -o. I've tried simple commands instead of the program i intend to run, like echo or sleep, with same result.
Any help or source to learn is appreciated.

Related

Cancel jobs running on the same partition on SLURM

with the commands
$>squeue -u mnyber004
I can visualize all the submitted jobs on my cluster account (slurm)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16884 ada CPUeq6 mnyber00 R 1-01:26:17 1 srvcnthpc105
16882 ada CPUeq4 mnyber00 R 1-01:26:20 1 srvcnthpc104
16878 ada CPUeq2 mnyber00 R 1-01:26:31 1 srvcnthpc104
20126 ada CPUeq1 mnyber00 R 22:32:28 1 srvcnthpc103
22004 curie WRI_0015 mnyber00 R 16:11 1 srvcnthpc603
22002 curie WRI_0014 mnyber00 R 16:13 1 srvcnthpc603
22000 curie WRI_0013 mnyber00 R 16:14 1 srvcnthpc603
How to cancel all the jobs running on the partition ada?
In your case, scancel offers the appropriate filters, so you can simply run
scancel -u mnyber004 -p ada
Should it not have been the case, a frequent idiom is to use the more powerful filtering properties of squeue and the --format option to build the proper command and then feed it to sh:
squeue -u mnyber004 -p ada --format "scancel %i" | sh
You can play it safer by first saving to a file and then sourcing the file.
squeue -u mnyber004 -p ada --format "scancel %j" > /tmp/remove.sh
source remove.sh

Specify number of CPUs for a job on SLURM

I would like to run multiple jobs on a single node on my cluster. However, when I submit a job, it takes all available CPUs and so remaining jobs are queued. As an example, I made a script that request few resources and submit two jobs that are supposed to run at the same time.
#! /bin/bash
variable=$(seq 0 1 1)
for l in ${variable}
do
run_thread="./run_thread.sh"
cat << EOF > ${run_thread}
#! /bin/bash
#SBATCH -p normal
#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 1
#SBATCH --threads-per-core 1
#SBATCH --mem=10G
sleep 120
EOF
sbatch ${run_thread}
done
However, one job is running and the other user is pending:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
57 normal run_thre user PD 0:00 1 (Resources)
56 normal run_thre user R 0:02 1 node00
The cluster only has one node with 4 sockets with 12 cores and 2 threads each. the output of command scontrol show jobid #job is the following:
JobId=56 JobName=run_thread.sh
UserId=user(1002) GroupId=user(1002) MCS_label=N/A
Priority=4294901755 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2018-03-24T15:34:46 EligibleTime=2018-03-24T15:34:46
StartTime=2018-03-24T15:34:46 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=node00:13047
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node00
BatchHost=node00
NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=48,mem=10G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=./run_thread.sh
WorkDir=/home/user
StdErr=/home/user/slurm-56.out
StdIn=/dev/null
StdOut=/home/user/slurm-56.out
Power=
And the output of scontrol show partition is:
PartitionName=normal
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node00
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
There is something I don't get with the SLURM system. How can I use only 1 CPU per job and run 48 jobs on the node at the same time?
Slurm is probably configured with
SelectType=select/linear
which means that slurm allocates full nodes to jobs and does not allow node sharing among jobs.
You can check with
scontrol show config | grep SelectType
Set a value of select/cons_res to allow node sharing.

SLURM: How to view completed jobs full name?

sacct -n returns all job's name trimmed for example" QmefdYEri+.
[Q] How could I view the complete name of the job, instead of its trimmed version?
--
$ sacct -n
1194 run.sh debug root 1 COMPLETED 0:0
1194.batch batch root 1 COMPLETED 0:0
1195 run_alper+ debug root 1 COMPLETED 0:0
1195.batch batch root 1 COMPLETED 0:0
1196 QmefdYEri+ debug root 1 COMPLETED 0:0
1196.batch batch root 1 COMPLETED 0:0
I use the scontrol command when I am interested in one particular jobid as shown below (output of the command taken from here).
$ scontrol show job 106
JobId=106 Name=slurm-job.sh
UserId=rstober(1001) GroupId=rstober(1001)
Priority=4294901717 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
StartTime=2013-01-26T12:55:02 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=atom-head1:3526
ReqNodeList=(null) ExcNodeList=(null)
NodeList=atom01
BatchHost=atom01
NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/rstober/slurm/local/slurm-job.sh
WorkDir=/home/rstober/slurm/local
If you want to use sacct, you can modify the number of characters that are displayed for any given field as explained in the documentation:
-o, --format Comma separated list of fields. (use "--helpformat" for a list of available fields). NOTE: When using the format option for
listing various fields you can put a %NUMBER afterwards to specify how
many characters should be printed.
e.g. format=name%30 will print 30 characters of field name right
justified. A %-30 will print 30 characters left justified.
Therefore, you can do something like this:
sacct --format="JobID,JobName%30,Partition,Account,AllocCPUS,State,ExitCode"
if you want the JobName row to be 30-characters wide.

Segmentation fault while using bash script to generate mobility file

I am using a bash script to generate mobility files (setdest) in ns2 for various seeds. But I am running into this troublesome segmentation fault. Any help would be appreciated. The setdest.cc has been modified, so its not the standard ns2 file.
I will walk you through the problem.
This code in a shell script returns the segmentation fault.
#! /bin/sh
setdest="/root/ns-allinone-2.1b9a/ns-2.1b9a/indep-utils/cmu-scen-gen/setdest/setdest_mesh_seed_mod"
let nn="70" #Number of nodes in the simulation
let time="900" #Simulation time
let x="1000" #Horizontal dimensions
let y="1000" #Vertical dimensions
for speed in 5
do
for pause in 10
do
for seed in 1 5
do
echo -e "\n"
echo Seed = $seed Speed = $speed Pause Time = $pause
chmod 700 $setdest
setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 > scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
done
done
done
error is
scengen_mesh: line 21: 14144 Segmentation fault $setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 >scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
line 21 is the last line of the shell script (done)
The strange thing is If i run the same setdest command on the terminal, there is no problem! like
$setdest -n 70 -p 10 -s 5 -t 900 -x 1000 -y 1000 -l 1 -m 50
I have made out where the problem is exactly. Its with the argument -l. If i remove the argument in the shell script, there is no problem. Now i will walk you through the modified setdest.cc where this argument is coming from.
This modified setdest file uses a text file initpos to read XY coordinates of static nodes for a wireless mesh topology. the relevant lines of code are
FILE *fp_loc;
int locinit;
fp_loc = fopen("initpos","r");
while ((ch = getopt(argc, argv, "r:m:l:n:p:s:t:x:y:i:o")) != EOF) {
switch (ch) {
case 'l':
locinit = atoi(optarg);
break;
default:
usage(argv);
exit(1);
if(locinit)
fscanf(fp_loc,"%lf %lf",&position.X, &position.Y);
if (position.X == -1 && position.Y == -1){
position.X = uniform() * MAXX;
position.Y = uniform() * MAXY;
}
What i dont get is...
In Shell script..
-option -l if supplied by 0 returns no error,
-but if supplied by any other value (i used 1 mostly) returns this segmentation fault.
In Terminal..
-no segmentation fault with any value. 0 or 1
something to do with the shell script surely. I am amazed what is going wrong where!
Your help will be highly appreciated.
Cheers

creating new queue using torque/PBS "access from host not allowed"

I have carried out the following commands.
qmgr -c "create queue fastq queue_type=execution"
qmgr -c "set queue fastq started=true"
qmgr -c "set queue fastq enabled=true"
qmgr -c "set queue fastq acl_hosts=compute-0-30"
qmgr -c "set queue fastq acl_host_enable=true"
qmgr -c "set queue fastq acl_users=username"
qmgr -c "set queue fastq acl_user_enable=true"
But when I have the following header for my PBS script,
#!/bin/sh
#PBS -l nodes=1:ppn=8
#PBS -N job
#PBS -u username
#PBS -q fastq
#PBS -be
mpirun script
I get the following error:
host.edu > qsub runscript
qsub: Access from host not allowed, or unknown host MSG=host ACL rejected the submitting host: user username#email.com, queue fastq, host host.edu
If you are submitting the job from a computer that is not the torque server, you will also need to set the submit_hosts flag.
See Appendix B in the documentation:
http://docs.adaptivecomputing.com/torque/help.htm#topics/12-appendices/serverParameters.htm#submist_hosts
Take special care to use the full hostname of the machine ( the value returned from uname -n).
You should always use the fully qualified hostname as torque can be quite picky.
I would also recommend that you take a look at torque.setup inside the root directory of the tar ball. It will seed a basic configuration.
qmgr -c 'p s'
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = foo.edu
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 19193
set server moab_array_compatible = True

Resources