How to set maximum allowed CPUs per job in Slurm? - slurm

How can I set the maximum number of CPUs each job can ask for in Slurm?
We're running a GPU cluster and want a sensible number of CPUs to be always available for GPU jobs. This is kind of fine as long as the job asks for GPUs because there's GPU <-> CPU mapping in the gres.conf. But this doesn't stop a job that doesn't ask for any GPUs not to acquire all CPUs in the system.

To set the maximum number of CPUs a single job can use, at the cluster level, you can run the following command:
sacctmgr modify cluster <cluster_name> set maxtresperjob=cpu=<nb of CPUs>
Note that you must have SelectType=select/cons_tres in your configuration file for this to work.
Alternatively the same restriction can be applied partition-wise, QOS-wise, account-wise, etc.

Related

Limit cores per Apache Spark job

I have a dataset for which I'd like to run multiple jobs for in parallel.
I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal.
So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
parallelize as much as possible.
limit any one job from hogging resources IF (and only if) another job is running as well.
So:
if you increase number of partitions then you'll address #1 but not #2.
if you specify spark.cores.max then you'll address #2 but not #1.
if you do both (more partitions and limit spark.cores.max) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max).
In general I would stay away from restricting concurrency (spark.cores.max).
Bottom line, IMO
don't touch spark.cores.max.
increase partitions if you're not using all your cores.
use fair scheduling
if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs

Can Spark executor be enabled for multithreading more than CPU cores?

I understand if executor-cores is set to more than 1, then the executor will run in parallel. However, from my experience, the number of parallel processes in the executor is always equal to the number of CPUs in the executor.
For example, suppose I have a machine with 48 cores and set executor-cores to 4, and then there will be 12 executors.
What we need is to run 8 threads or more for each executor (so 2 or more threads per CPU). The reason is that the task is quite light weight and CPU usage is quite low around 10%, so we want to boost CPU usage through multiple threads per CPU.
So asking if we could possibly achieve this in the Spark configuration. Thanks a lot!
Spark executors are processing tasks, which are derived from the execution plan/code and partitions of the dataframe. Each core on an executor is always processing only one task, so each executor only get the number of tasks at most the amount of cores. Having more tasks in one executor as you are asking for is not possible.
You should look for code changes, minimize amount of shuffles (no inner joins; use windows instead) and check out for skew in your data leading to non-uniformly distributed partition sizing (dataframe partitions, not storage partitions).
WARNING:
If you are however alone on your cluster and you do not want to change your code, you can change the YARN settings for the server and represent it with more than 48 cores, even though there are just 48. This can lead to severe instability of the system, since executors are now sharing CPUs. (And your OS also needs CPU power.)
This answer is meant as a complement to #Telijas' answer, because in general I agree with it. It's just to give that tiny bit of extra information.
There are some configuration parameters in which you can set the number of thread for certain parts of Spark. There is, for example, a section in the Spark docs that discusses some of them (for all of this I'm looking at the latest Spark version at the time of writing this post: version 3.3.1):
Depending on jobs and cluster configurations, we can set number of threads in several places in Spark to utilize available resources efficiently to get better performance. Prior to Spark 3.0, these thread configurations apply to all roles of Spark, such as driver, executor, worker and master. From Spark 3.0, we can configure threads in finer granularity starting from driver and executor. Take RPC module as example in below table. For other modules, like shuffle, just replace “rpc” with “shuffle” in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module.
Property Name
Default
Meaning
Since Version
spark.{driver
executor}.rpc.io.serverThreads
Fall back on spark.rpc.io.serverThreads
Number of threads used in the server thread pool
spark.{driver
executor}.rpc.io.clientThreads
Fall back on spark.rpc.io.clientThreads
Number of threads used in the client thread pool
spark.{driver
executor}.rpc.netty.dispatcher.numThreads
Fall back on spark.rpc.netty.dispatcher.numThreads
Number of threads used in RPC message dispatcher thread pool
Then here follows a (non-exhaustive in no particular order, just been looking through the source code) list of some other number-of-thread-related configuration parameters:
spark.sql.streaming.fileSource.cleaner.numThreads
spark.storage.decommission.shuffleBlocks.maxThreads
spark.shuffle.mapOutput.dispatcher.numThreads
spark.shuffle.push.numPushThreads
spark.shuffle.push.merge.finalizeThreads
spark.rpc.connect.threads
spark.rpc.io.threads
spark.rpc.netty.dispatcher.numThreads (will be overridden by the driver/executor-specific ones from the table above)
spark.resultGetter.threads
spark.files.io.threads
I didn't add the meaning of these parameters to this answer because that's a different question and quite "Googleable". This is just meant as an extra bit of info.

how to specify max memory per core for a slurm job

I want to specify max amount of memory per core for a batch job in slurm
i can see two sbatch memory options:
--mem=MB maximum amount of real memory per node required by the job.
--mem-per-cpu=mem amount of real memory per allocated CPU required by the job.
either of those options suits my needs
any suggestions how to achive this goal
You can use --mem=MaxMemPerNode to use the maximum allowed memory for the job in that node. if configured in the cluster, you can see the value MaxMemPerNode using scontrol show config.
A special case, setting --mem=0 will also give the job access to all of the memory on each node.(This is not ideal in a heterogeneous cluster since, the lowest memory value among the nodes will only be used for all the allocated nodes).
If configured in the cluster, --mem-per-cpu=MaxMemPerCPU can be used to enable using the maximum allowed memory per cpu.

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.
Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them
Dataproc image of Yarn and Spark has the following defaults:
Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator
With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = <num> as properties in gcloud submit)
So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.
It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*
The job submission is done by means of gcloud tool and the queue used is the default.
E.g, the following properties set when executing gcloud dataproc submit:
--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3
end up in the following assignment:
Is there a way to configure YARN so that it accepts both?
EDITED to specify queue setup
You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.

How to let slurm limit memory per node

Slurm manages a cluster with 8core/64GB ram and 16core/128GB ram nodes.
There is a low-priority "long" partition and a high-priority "short" partition.
Jobs running in the long partition can be suspended by jobs in the short partition, in which case pages from the suspended job get mostly pushed to swap. (Swap usage is intended for this purpose only, not for active jobs.)
How can I configure in slurm the total amount of RAM+swap available in each node for jobs?
There is the MaxMemPerNode parameter, but that is a partition property and thus cannot accommodate different values for different nodes in the partition.
There is the MaxMemPerCPU parameter, but that prevents low-memory jobs to share unused memory with big-memory jobs.
You need to specify the memory of each node using the RealMemory parameter in the node definition (see the slurm.conf manpage)

Resources