slurm high priority to jobs with lower resources

slurm high priority to jobs with lower resources - slurm

How can slurm assign high priority to jobs with lower resources?
This requires slurm tres, but with the settings, I can only get higher priority to high resources.
PriorityType=priority/multifactor
PriorityFlags=SMALL_RELATIVE_TO_TIME
AccountingStorageTRES=cpu,mem
PriorityWeightTRES=cpu=100,mem=1000
Someone have example slurm settings ?
lower CPU -> higher priority
lower memory -> higher priority

You're looking for the PriorityFavorSmall option in the slurm.conf. Take a look at the Priority/Multifactor page. What you need is something like:
PriorityWeightJobSize=1000 #This value depends on the other weights. Choose something suitable for your config.
PriorityFavorSmall=YES
PriorityFlags=SMALL_RELATIVE_TO_TIME #If you want to take the walltime into account

Related

Local and Global size influence on program execution - OpenCl

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!

I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
algorithm?
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

Understanding KVM CPU scheduler algorithm

I am trying to understand CPU scheduling algorithm in KVM, but I haven't found the appropriate documentation for it.
For example, in XEN, when more than 1 vCPU is assigned to a single physical CPU (i.e., overcommitting), XEN's default Credit Scheduler decides the order at which vCPUs will get access to that single pCPU. Then there are a number of parameters that can adjust the default behaviour, i.e., you can change default scheduling quanta (from 30~ms), you can assign different weights to VMs giving more/less CPU time, set work-preserving mode etc.
However, I am not clear about the degree of control that you get in KVM. This documentation explains how to pin vCPUs to pCPUs (which works fine). But I would like to know which scheduling algorithm is used by KVM and do we have any way to tweak it? For example to give more priority (CPU time) to some VMs or adjust I/O vs computing intensive tasks?
Thanks!

KVM is a Kernel-based virtualization infrastructre, so it uses Linux Kernel's native CPU scheduler, which is CFS by default.
*Source of image from ResearchGate

How to fix the number of CPUs used by MemSQL?

I am running MemSQL(1 agg and 5 leaf nodes) on a single box which has 2 TB of RAM.
However this is a shared system and there are other processes running on it. When I deploy the cluster and run few queries, the CPU utilization goes really high and looks like it uses all the cores. Is there a way I can prevent this happening by specifying the number of cores to use?
I checked the documents and there is a paramter called maximum_memory which is set by default to 90% of the host memory. Is this the parameter that needs to be changed?

There is no MemSQL configuration option to limit the number of cores. The cpu utilization decrease you observed from reducing your maximum_memory is indicative of the system using less machine resources overall (you reduced memory availablilty to the system by 80%).
If you want to limit the number of CPUs being used by MemSQL, use taskset.

Netlogo HPC CPU Percentage Use Increase

I submit jobs using headless NetLogo to a HPC server by the following code:
#!/bin/bash
#$ -N r20p
#$ -q all.q
#$ -pe mpi 24
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/corrected-rk4-20presults.nlogo \
--experiment test \
--table /home/abhishekb/csvresults/corrected-rk4-20presults.csv
Below is the snapshot of a cluster queue using:
qstat -g c
I wish to know can I increase the CQLOAD for my simulations and what does it signify too. I couldn't find an elucidate explanation online.
CPU USAGE CHECK:
qhost -u abhishekb
When I run the behaviour space on my PC through gui assigning high priority to the task makes it use nearly 99% of the CPU which makes it run faster. It uses a greater percentage of CPU processor. I wish to accomplish the same here.
EDIT:
EDIT 2;

A typical HPC environment, is designed to run only one MPI process (or OpenMP thread) per CPU core, which has therefore access to 100% of CPU time, and this cannot be increased further. In contrast, on a classical desktop/server machine, a number of processes compete for CPU time, and it is indeed possible to increase performance of one of them by setting the appropriate priorities with nice.
It appears that CQLOAD, is the mean load average for that computing queue. If you are not using all the CPU cores in it, it is not a useful indicator. Besides, even the load average per core for your runs just translates the efficiency of the code on this HPC cluster. For instance, a value of 0.7 per core, would mean that the code spends 70% of time doing calculations, while the remaining 30% are probably spent waiting to communicate with the other computing nodes (which is also necessary).
Bottom line, the only way you can increase the CPU percentage use on an HPC cluster is by optimising your code. Normally though, people are more concerned about the parallel scaling (i.e. how the time to solution decreases with the number of CPU cores) than with the CPU percentage use.

1. CPU percentage load
I agree with #rth answer regards trying to use linux job priority / renice to increase CPU percentage - it's
almost certain not to work
and, (as you've found)
you're unlikely to be able to do it as you won't have super user priveliges on the nodes (It's pretty unlikely you can even log into the worker nodes - probably only the head node)
The CPU usage of your model as it runs is mainly a function of your code structure - if it runs at 100% CPU locally it will probably run like that on the node during the time its running.
Here are some answers to the more specific parts of your question:
2. CQLOAD
You ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output screenshot above shows 0.84, so this indicator average load on (in-use) processors in all.q is 84%. This doesn't seem too low.
3. Number of nodes reserved
In a related question, you state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder the real problem here is that you're reserving a lot of nodes (even if just for a short time) for a job that they can see could work with fewer.
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 24 - maybe take the number 24 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = ((time to run 1 job) * number of runs in experiment) / (time you want the run to take)

So you want to make to make your program run faster on linux by giving it a higher priority than all other processes?
In that case you have to modify something called the program's niceness. This is normally done by invoking the command nice when you first start the program or the command renice while the program is already running. A process can have a niceness from -20 to 19 (inclusive) where lower values give the process a higher priority. Due to security reasons, you can only decrease a processes' niceness if you are the super user (root).
So if you want to make a process run with higher priority then from within bash do
[abhishekb#hpc ~]$ start_process &
[abhishekb#hpc ~]$ jobs -x sudo renice -n -20 -p %+
Or just use the last command and replace the %+ with the process id of the process you want to increase the priority for.

How can I dynamically allocate CPU resources to processes in Linux?

In Linux (our system is CentOS5), is it possible to allocation CPU resources to processes? For example, I have one tomcat application, I want all the processes and threads invoked by tomcat has p% of total CPU cycles no matter how many other applications are running. And I want to tune the p% dynamically (e.g., at this time slot, tomcat has 40% cpu cycles, and at the next time slot, it has 70% cpu cycles).
If the above is possible, is it possible to do it conservatively? I mean, even though tomcat has 40% cpu cycles, but if it's current workload only consumes 10%, other applications can use the remaining 30% CPU cycles.
Thank.

If you can use RHEL6/CentOS6 (or upgrade kernel), you can use cgroup to do what you want to do.
http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
https://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html

Are you familiar with the tool nice and niceness levels?
Rather than trying to dictate exact percentages, you might want to check into niceness levels and how to set them in CentOS. Your applications will run as expected with the higher priority processes being able to claim more resources while the lower priority processes will not suffer from lack of resources even when a higher priority process is idle.

If you really really wanted to do this (and esnyder's suggestion of prioritising using nice levels is almost certainly a better solution for whatever you're really trying to achieve) AND you're happy to do it at the granularity of 1/number-of-CPUs (e.g on an 8 core system, specify utilisation as a multiple of 12.5% of total CPU resource) then you could use sched_setaffinity to set the CPU affinity mask for the process you want to control (you can do this from another process). (Actually, I think you'd need to identify all that process' threads and invoke it on each of them).
Alternatively, cpusets might be of interest but I've no idea what it takes to enable them or how dynamic they can be.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string