I'm fairly new to slurm. I couldn't find my problem in any forum, so I guess either its very simple or very unnusual (or I don't know how to search).
The script I'm submitting is
#!/bin/bash
#
#SBATCH -p all # partition (queue)
#SBATCH -N 1 # number of nodes
#SBATCH -n 1 # number of cores
#SBATCH -o ./slurm.%N.%j.out # STDOUT
#SBATCH -e ./slurm.%N.%j.err # STDERR
#SBATCH -t 300
#SBATCH --mem=5000
./kzsqrt 10.0
When I use
$ squeue -u rmelo
the queue is empty. If I try
$ show jobid -dd 157
the result is
JobId=157 Name=script_10.0.sh
UserId=rmelo(508) GroupId=rmelo(509)
Priority=4294901747 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:1
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=05:00:00 TimeMin=N/A
SubmitTime=2017-05-07T16:00:45 EligibleTime=2017-05-07T16:00:45
StartTime=2017-05-07T16:00:45 EndTime=2017-05-07T16:00:45
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=all AllocNode:Sid=headnode:20528
ReqNodeList=(null) ExcNodeList=(null)
NodeList=service1
BatchHost=service1
NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=service1 CPU_IDs=0-11 Mem=0
MinCPUsNode=1 MinMemoryNode=5000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/rmelo/modeloantigo/script_10.0.sh
WorkDir=/home/rmelo/modeloantigo
So my job is finishing instantly, without doing nothing. It doesn't even create the output file specified with #SBATCH -o. I've tried simple commands instead of the program i intend to run, like echo or sleep, with same result.
Any help or source to learn is appreciated.
Related
with the commands
$>squeue -u mnyber004
I can visualize all the submitted jobs on my cluster account (slurm)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16884 ada CPUeq6 mnyber00 R 1-01:26:17 1 srvcnthpc105
16882 ada CPUeq4 mnyber00 R 1-01:26:20 1 srvcnthpc104
16878 ada CPUeq2 mnyber00 R 1-01:26:31 1 srvcnthpc104
20126 ada CPUeq1 mnyber00 R 22:32:28 1 srvcnthpc103
22004 curie WRI_0015 mnyber00 R 16:11 1 srvcnthpc603
22002 curie WRI_0014 mnyber00 R 16:13 1 srvcnthpc603
22000 curie WRI_0013 mnyber00 R 16:14 1 srvcnthpc603
How to cancel all the jobs running on the partition ada?
In your case, scancel offers the appropriate filters, so you can simply run
scancel -u mnyber004 -p ada
Should it not have been the case, a frequent idiom is to use the more powerful filtering properties of squeue and the --format option to build the proper command and then feed it to sh:
squeue -u mnyber004 -p ada --format "scancel %i" | sh
You can play it safer by first saving to a file and then sourcing the file.
squeue -u mnyber004 -p ada --format "scancel %j" > /tmp/remove.sh
source remove.sh
I would like to run multiple jobs on a single node on my cluster. However, when I submit a job, it takes all available CPUs and so remaining jobs are queued. As an example, I made a script that request few resources and submit two jobs that are supposed to run at the same time.
#! /bin/bash
variable=$(seq 0 1 1)
for l in ${variable}
do
run_thread="./run_thread.sh"
cat << EOF > ${run_thread}
#! /bin/bash
#SBATCH -p normal
#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 1
#SBATCH --threads-per-core 1
#SBATCH --mem=10G
sleep 120
EOF
sbatch ${run_thread}
done
However, one job is running and the other user is pending:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
57 normal run_thre user PD 0:00 1 (Resources)
56 normal run_thre user R 0:02 1 node00
The cluster only has one node with 4 sockets with 12 cores and 2 threads each. the output of command scontrol show jobid #job is the following:
JobId=56 JobName=run_thread.sh
UserId=user(1002) GroupId=user(1002) MCS_label=N/A
Priority=4294901755 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2018-03-24T15:34:46 EligibleTime=2018-03-24T15:34:46
StartTime=2018-03-24T15:34:46 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=node00:13047
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node00
BatchHost=node00
NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=48,mem=10G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=./run_thread.sh
WorkDir=/home/user
StdErr=/home/user/slurm-56.out
StdIn=/dev/null
StdOut=/home/user/slurm-56.out
Power=
And the output of scontrol show partition is:
PartitionName=normal
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node00
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
There is something I don't get with the SLURM system. How can I use only 1 CPU per job and run 48 jobs on the node at the same time?
Slurm is probably configured with
SelectType=select/linear
which means that slurm allocates full nodes to jobs and does not allow node sharing among jobs.
You can check with
scontrol show config | grep SelectType
Set a value of select/cons_res to allow node sharing.
sacct -n returns all job's name trimmed for example" QmefdYEri+.
[Q] How could I view the complete name of the job, instead of its trimmed version?
--
$ sacct -n
1194 run.sh debug root 1 COMPLETED 0:0
1194.batch batch root 1 COMPLETED 0:0
1195 run_alper+ debug root 1 COMPLETED 0:0
1195.batch batch root 1 COMPLETED 0:0
1196 QmefdYEri+ debug root 1 COMPLETED 0:0
1196.batch batch root 1 COMPLETED 0:0
I use the scontrol command when I am interested in one particular jobid as shown below (output of the command taken from here).
$ scontrol show job 106
JobId=106 Name=slurm-job.sh
UserId=rstober(1001) GroupId=rstober(1001)
Priority=4294901717 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
StartTime=2013-01-26T12:55:02 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=atom-head1:3526
ReqNodeList=(null) ExcNodeList=(null)
NodeList=atom01
BatchHost=atom01
NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/rstober/slurm/local/slurm-job.sh
WorkDir=/home/rstober/slurm/local
If you want to use sacct, you can modify the number of characters that are displayed for any given field as explained in the documentation:
-o, --format Comma separated list of fields. (use "--helpformat" for a list of available fields). NOTE: When using the format option for
listing various fields you can put a %NUMBER afterwards to specify how
many characters should be printed.
e.g. format=name%30 will print 30 characters of field name right
justified. A %-30 will print 30 characters left justified.
Therefore, you can do something like this:
sacct --format="JobID,JobName%30,Partition,Account,AllocCPUS,State,ExitCode"
if you want the JobName row to be 30-characters wide.
I am using a bash script to generate mobility files (setdest) in ns2 for various seeds. But I am running into this troublesome segmentation fault. Any help would be appreciated. The setdest.cc has been modified, so its not the standard ns2 file.
I will walk you through the problem.
This code in a shell script returns the segmentation fault.
#! /bin/sh
setdest="/root/ns-allinone-2.1b9a/ns-2.1b9a/indep-utils/cmu-scen-gen/setdest/setdest_mesh_seed_mod"
let nn="70" #Number of nodes in the simulation
let time="900" #Simulation time
let x="1000" #Horizontal dimensions
let y="1000" #Vertical dimensions
for speed in 5
do
for pause in 10
do
for seed in 1 5
do
echo -e "\n"
echo Seed = $seed Speed = $speed Pause Time = $pause
chmod 700 $setdest
setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 > scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
done
done
done
error is
scengen_mesh: line 21: 14144 Segmentation fault $setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 >scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
line 21 is the last line of the shell script (done)
The strange thing is If i run the same setdest command on the terminal, there is no problem! like
$setdest -n 70 -p 10 -s 5 -t 900 -x 1000 -y 1000 -l 1 -m 50
I have made out where the problem is exactly. Its with the argument -l. If i remove the argument in the shell script, there is no problem. Now i will walk you through the modified setdest.cc where this argument is coming from.
This modified setdest file uses a text file initpos to read XY coordinates of static nodes for a wireless mesh topology. the relevant lines of code are
FILE *fp_loc;
int locinit;
fp_loc = fopen("initpos","r");
while ((ch = getopt(argc, argv, "r:m:l:n:p:s:t:x:y:i:o")) != EOF) {
switch (ch) {
case 'l':
locinit = atoi(optarg);
break;
default:
usage(argv);
exit(1);
if(locinit)
fscanf(fp_loc,"%lf %lf",&position.X, &position.Y);
if (position.X == -1 && position.Y == -1){
position.X = uniform() * MAXX;
position.Y = uniform() * MAXY;
}
What i dont get is...
In Shell script..
-option -l if supplied by 0 returns no error,
-but if supplied by any other value (i used 1 mostly) returns this segmentation fault.
In Terminal..
-no segmentation fault with any value. 0 or 1
something to do with the shell script surely. I am amazed what is going wrong where!
Your help will be highly appreciated.
Cheers
I have carried out the following commands.
qmgr -c "create queue fastq queue_type=execution"
qmgr -c "set queue fastq started=true"
qmgr -c "set queue fastq enabled=true"
qmgr -c "set queue fastq acl_hosts=compute-0-30"
qmgr -c "set queue fastq acl_host_enable=true"
qmgr -c "set queue fastq acl_users=username"
qmgr -c "set queue fastq acl_user_enable=true"
But when I have the following header for my PBS script,
#!/bin/sh
#PBS -l nodes=1:ppn=8
#PBS -N job
#PBS -u username
#PBS -q fastq
#PBS -be
mpirun script
I get the following error:
host.edu > qsub runscript
qsub: Access from host not allowed, or unknown host MSG=host ACL rejected the submitting host: user username#email.com, queue fastq, host host.edu
If you are submitting the job from a computer that is not the torque server, you will also need to set the submit_hosts flag.
See Appendix B in the documentation:
http://docs.adaptivecomputing.com/torque/help.htm#topics/12-appendices/serverParameters.htm#submist_hosts
Take special care to use the full hostname of the machine ( the value returned from uname -n).
You should always use the fully qualified hostname as torque can be quite picky.
I would also recommend that you take a look at torque.setup inside the root directory of the tar ball. It will seed a basic configuration.
qmgr -c 'p s'
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = foo.edu
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 19193
set server moab_array_compatible = True