Cancel jobs running on the same partition on SLURM

Cancel jobs running on the same partition on SLURM - linux

with the commands
$>squeue -u mnyber004
I can visualize all the submitted jobs on my cluster account (slurm)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16884 ada CPUeq6 mnyber00 R 1-01:26:17 1 srvcnthpc105
16882 ada CPUeq4 mnyber00 R 1-01:26:20 1 srvcnthpc104
16878 ada CPUeq2 mnyber00 R 1-01:26:31 1 srvcnthpc104
20126 ada CPUeq1 mnyber00 R 22:32:28 1 srvcnthpc103
22004 curie WRI_0015 mnyber00 R 16:11 1 srvcnthpc603
22002 curie WRI_0014 mnyber00 R 16:13 1 srvcnthpc603
22000 curie WRI_0013 mnyber00 R 16:14 1 srvcnthpc603
How to cancel all the jobs running on the partition ada?

In your case, scancel offers the appropriate filters, so you can simply run
scancel -u mnyber004 -p ada
Should it not have been the case, a frequent idiom is to use the more powerful filtering properties of squeue and the --format option to build the proper command and then feed it to sh:
squeue -u mnyber004 -p ada --format "scancel %i" | sh
You can play it safer by first saving to a file and then sourcing the file.
squeue -u mnyber004 -p ada --format "scancel %j" > /tmp/remove.sh
source remove.sh

Related

Weka command line attributes arguments

On the command line, I'm able to get this rolling with no problem:
java weka.Run weka.classifiers.timeseries.WekaForecaster -W
"weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 20 " -t "C:\MyFile.arff" -F DirectionNumeric -L 1 -M 3 -prime 3 -horizon 6 -holdout 100 -G TradeDay -dayofweek -weekend -future
But once I try to put the skip list, I start to get errors saying that it's missing a date that is not in the skip list even though the date is in fact on it:
java weka.Run weka.classifiers.timeseries.WekaForecaster -W "weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 20 " -t "C:\MyFile.arff" -F DirectionNumeric -L 1 -M 3 -prime 3 -horizon 6 -holdout 100 -G TradeDay -dayofweek -weekend -future -skip ""2014-06-07#yyyy-MM-dd, 2014-06-12"
Does anybody knows how to get this working? Weka is low on documentation as far as I know.
Thank's in advance!

Forget it. I got it, the problem was the 's' must be in capital letters:
-Skip
instead of
-skip.

how to use the qsub without root login

I borrowed the SGE system, but it doesn't work when I qsub.
#!/bin/bash
#PBS -l nodes=1:ppn=4
#PBS -N pix2pix
#PBS -o pix2pix.out
#PBS -e pix2pix.err
#PBS -q GPU_HIGH
#PBS -l walltime=72000:00:00
#cd $PBS_0_WORKDIR
cd /public/home/chensu/others/Guicai/pix2pix-tensorflow
source activate tensorflow
python /public/home/chensu/others/Guicai/pix2pix-tensorflow/pix2pix.py --mode train --output_dir /public/home/chensu/others/Guicai/pix2pix-tensorflow/face2face-model --max_epochs 2000 --input_dir /public/home/chensu/others/Guicai/pix2pix-tensorflow/photos/combined/train --which_direction AtoB --batch_size 4
there are many free queues.
qstat -q
server: admin1
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
FAT_HIGH -- -- -- -- 0 0 0 E R
high -- -- -- -- 10 0 10 E R
MIC_HIGH -- -- -- -- 0 0 0 E R
GPU_HIGH -- -- -- -- 0 0 0 E R
low -- -- -- -- 4 0 20 E R
batch -- -- -- -- 0 0 20 E R
middle -- -- -- -- 0 0 10 E R
----- -----
14 0
when I qsub my test.pbs with qsub test.pbs
MGMT1:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
20414.MGMT1 chensu GPU_HIGH pix2pix -- 1 4 -- 72000:00: C --
Also there are no log, so I don't know what happened.
Any suggestions will be appreciated

Specify number of CPUs for a job on SLURM

I would like to run multiple jobs on a single node on my cluster. However, when I submit a job, it takes all available CPUs and so remaining jobs are queued. As an example, I made a script that request few resources and submit two jobs that are supposed to run at the same time.
#! /bin/bash
variable=$(seq 0 1 1)
for l in ${variable}
do
run_thread="./run_thread.sh"
cat << EOF > ${run_thread}
#! /bin/bash
#SBATCH -p normal
#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 1
#SBATCH --threads-per-core 1
#SBATCH --mem=10G
sleep 120
EOF
sbatch ${run_thread}
done
However, one job is running and the other user is pending:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
57 normal run_thre user PD 0:00 1 (Resources)
56 normal run_thre user R 0:02 1 node00
The cluster only has one node with 4 sockets with 12 cores and 2 threads each. the output of command scontrol show jobid #job is the following:
JobId=56 JobName=run_thread.sh
UserId=user(1002) GroupId=user(1002) MCS_label=N/A
Priority=4294901755 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2018-03-24T15:34:46 EligibleTime=2018-03-24T15:34:46
StartTime=2018-03-24T15:34:46 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=node00:13047
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node00
BatchHost=node00
NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
TRES=cpu=48,mem=10G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=./run_thread.sh
WorkDir=/home/user
StdErr=/home/user/slurm-56.out
StdIn=/dev/null
StdOut=/home/user/slurm-56.out
Power=
And the output of scontrol show partition is:
PartitionName=normal
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node00
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
There is something I don't get with the SLURM system. How can I use only 1 CPU per job and run 48 jobs on the node at the same time?

Slurm is probably configured with
SelectType=select/linear
which means that slurm allocates full nodes to jobs and does not allow node sharing among jobs.
You can check with
scontrol show config | grep SelectType
Set a value of select/cons_res to allow node sharing.

pgrep in linux only identify 15 bytes proc name

in my Linux , while I run shell : ps -ef | grep Speed , I got the following :
myid 143410 49092 0 10:21 pts/12 00:00:00 ./OutSpeedyOrderConnection
myid 145492 49053 0 10:35 pts/11 00:00:00 ./SpeedyOrderConnection
That means , the pid of these 2 process are 143410 and 145492 .
Then I run shell : pgrep -l Speed , I got the following :
143410 OutSpeedyOrderC
145492 SpeedyOrderConn
and I run shell : pgrep OutSpeedyOrderC , I got :
143410
pgrep OutSpeedyOrderCo will get nothing !!!!!
look like pgrep will only identify 15 bytes of processname ,
anything I can do to get the right answer while I run
pgrep OutSpeedyOrderConnection ?!

perf : How to check processess running on particular cpu

Is there any option in perf to look into processes running on a particular cpu /core, and how much percentage of that core is taken by each process.
Reference links would be helpful.

perf is intended to do a profiling which is not good fit for your case. You may try to do sampling /proc/sched_debug (if it is compiled in your kernel). For example you may check which process is currently running on CPU:
egrep '^R|cpu#' /proc/sched_debug
cpu#0, 917.276 MHz
R egrep 2614 37730.177313 ...
cpu#1, 917.276 MHz
R bash 2023 218715.010833 ...
By using his PID as a key, you may check how many CPU time in milliseconds it consumed:
grep se.sum_exec_runtime /proc/2023/sched
se.sum_exec_runtime : 279346.058986
However, as #BrenoLeitão mentioned, SystemTap is quite useful for your script. Here is script for your task.
global cputimes;
global cmdline;
global oncpu;
global NS_PER_SEC = 1000000000;
probe scheduler.cpu_on {
oncpu[pid()] = local_clock_ns();
}
probe scheduler.cpu_off {
if(oncpu[pid()] == 0)
next;
cmdline[pid()] = cmdline_str();
cputimes[pid(), cpu()] <<< local_clock_ns() - oncpu[pid()];
delete oncpu[pid()];
}
probe timer.s(1) {
printf("%6s %3s %6s %s\n", "PID", "CPU", "PCT", "CMDLINE");
foreach([pid+, cpu] in cputimes) {
cpupct = #sum(cputimes[pid, cpu]) * 10000 / NS_PER_SEC;
printf("%6d %3d %3d.%02d %s\n", pid, cpu,
cpupct / 100, cpupct % 100, cmdline[pid]);
}
delete cputimes;
}
It traces moments when process is running on CPU and stops execution on that (due to migration or sleeping) by attaching to scheduler.cpu_on and scheduler.cpu_off probes. Second probe calculates time difference between these events and saves it to cputimes aggregation along with process command line arguments.
timer.s(1) fires once per second -- it walks over aggregation and calculates percentage. Here is sample output for Centos 7 with bash running infinite loop:
0 0 100.16
30 1 0.00
51 0 0.00
380 0 0.02 /usr/bin/python -Es /usr/sbin/tuned -l -P
2016 0 0.08 sshd: root#pts/0 "" "" "" ""
2023 1 100.11 -bash
2630 0 0.04 /usr/libexec/systemtap/stapio -R stap_3020c9e7ba76838179be68cd2390a10c_2630 -F3

I understand that perf is not the proper way to do it, although you can limit perf per CPU, as using perf record -C <cpulist> or even perf stat -c <cpulist>.
The close you are going to see is the context-switch event, but, this is not going to provide you the application names at all.
I think you are going to need something more powerful, as systemtap.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cancel jobs running on the same partition on SLURM - linux

Related

Weka command line attributes arguments

how to use the qsub without root login

Specify number of CPUs for a job on SLURM

pgrep in linux only identify 15 bytes proc name

perf : How to check processess running on particular cpu

Categories

Resources