Difference between dask node and compute node for slurm configuration - slurm

First off, apologies if I use confusing or incorrect terminology, I am still learning.
I am trying to set up configuration for a Slurm-enabled adaptive cluster.
Documentation of the supercomputer and it’s Slurm configuration is documented here. Here is some of the most relevant information extracted from the website:
Partition Name
Max Nodes per Job
Max Job Runtime
Max resources used simultaneously
Shared Node Usage
Default Memory per CPU
Max Memory per CPU
compute
512
8 hours
no limit
no
1920 MB
8000 MB
compute
This partition consists of 2659 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs. Some information about the compute node:
Component
Value
# of CPU Cores
64
# of Threads
128
Here is some output from control show partition:
PartitionName=compute
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=512 MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=l[10000-10058,10061-10062,10064-10065,10067-10068,10070-10083,10090-10095,10100-10158,10160-10183,10190-10195,10200-10258,10260-10283,10290-10295,10300-10357,10359-10383,10390-10395,10400-10483,10490-10495,10500-10583,10590-10595,10600-10683,10690-10695,10700-10783,10790-10795,20000-20059,20061-20062,20064-20065,20067-20068,20070-20083,20090-20095,20100-20183,20190-20195,20200-20223,20225-20283,20290-20295,20300-20383,20390-20395,20400-20483,20490-20495,20500-20583,20590-20595,20600-20683,20690-20695,30000-30059,30061-30062,30064-30083,30090-30095,30100-30183,30190-30195,30200-30230,30232-30283,30290-30295,30300-30383,30390-30395,30400-30483,30490-30495,30500-30583,30590-30595,30600-30683,30690-30695,30700-30760,30762-30783,30790-30795,40000-40026,40028-40029,40031-40032,40034-40035,40037-40038,40040-40083,40090-40095,40101-40102,40104-40105,40107-40108,40110-40111,40113-40183,40190-40195,40200-40283,40287-40295,40300-40359,40400-40483,40490-40495,40500-40583,40587-40595,40600-40683,40687-40695,50200-50259,50269-50271,50300-50359,50369-50371]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=711168 TotalNodes=2778 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=960 MaxMemPerCPU=3840
Here is what I have so far:
cluster = SLURMCluster(
name='dask-cluster',
processes=32,
cores=64,
memory=f"{8000 * 64 * 0.90} MB",
project="ab0995",
queue="compute",
interface='ib0',
walltime='08:00:00',
asynchronous=0,
# job_extra=["--ntasks-per-node=50",],
)
Some things to mention:
In the first above table, “nodes” refers to compute server nodes, not Dask nodes (which I think should probably be rather called Dask Workers? If someone could clear up that term for me I would be grateful). Since I have 64 CPU Cores and 8000 MB of allowed memory, I thought it would be sensible to set the memory to 8000 * 64 with a “reduction” factor of 0.90, just to be on the safe side.
I have 64 CPUs, which I believe should translate to 64 “cores” in the SLURMCluster. I want each Python to have 2 CPUs, so, in total 32 processes. That might be optimised down to 4 CPUs per Python, but I have no idea how to get a feeling for sensible settings here.
I set the walltime of each dask-cluster job to the maximum allowed; as I would rather block with one Slurm Job than need to wait. This might induce idle work of that server, but it might still be more effective than waiting in the Slurm batch system queue.
If I now print the job script as configured above, I get:
print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A ab0995
#SBATCH -n 1
#SBATCH --cpus-per-task=64
#SBATCH --mem=430G
#SBATCH -t 08:00:00
/work/ab0995/AWIsoft/miniconda/NextGEMS/.conda/bin/python -m distributed.cli.dask_worker tcp://136.172.120.121:36449 --nthreads 2 --nprocs 32 --memory-limit 13.41GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://
So, questions:
By my mental math, 8000*64*0.9 = 460.8 GB, not 430G What is happening here?
I don’t really understand the nthreads. nprocs, and memory-limit getting of the dask_worker…?
Can someone give me a good distinction between dask nodes, dask workers, compute nodes as seen in slurm?

I am in a similar situation to yours, where I am trying to understand how dask-ditributed works with Slurms.
The dask-distributed docs report that Slurm uses KB or GB, but it actually means KiB or GiB, so Dask converts your value in GiB.
What I've found is that nprocs=processes, nthreads=cores/processes and memory-limit=allocated-memory/processes. The job will then launch a dask_worker using nprocs processes of nthreads each (which will be the Workers of your SLURMCluster).
This is not clear to me as well, so I don't have a good answer. I think that since Slurm nodes have several CPUs, the Scheduler of your SLURMCluster manages Dask Workers based on the allocated CPUs. (I didn't find anything about dask nodes from the docs though)

Related

Can I set a variable amount of memory for SLURM jobs?

I am on a school server, where we have a login node, and several GPU nodes that can be accessed with SLURM. Specifically, I am running several jobs on a machine with 512 GB RAM, 64 CPUS, and 8 GPUs.
I want to train a large deep-learning model with a ton of text. After some trial and error, it seems I need 100 GB of RAM to load all the text data (12 GB batch file on disk) successfully, before training starts. However, while training, it only uses about 30 GB of RAM. I want to run several versions of this model, so I can only do up to 5 before running out of RAM to allocate.
Is there a way to have SLURM use a variable amount of RAM? I don't want to hog up too many resources unnecessarily. This are the SBATCH directives I am currently using:
#SBATCH --mem-per-cpu=20G
#SBATCH --cpus-per-task=5
#SBATCH --nodelist=gpu-large
#SBATCH --gpus=1
I currently have 3 jobs running, and this is what I see from free:
total used free shared buff/cache available
Mem: 503Gi 94Gi 246Gi 1.6Gi 161Gi 403Gi
Swap: 4.0Gi 96Mi 3.9Gi
So my jobs indeed only are using 30 GB each. Help would be greatly appreciated!
Disclaimer: the following is from memory, I have no access to a slurm cluster ATM to check that.
You should be able to update the memory requirement of the job after it started. AFAIR, as a simple user (as opposed to a slurm admin or root), you can only lower the limits, not increase them (to prevent highjacking priority or ressources). Say, if your job has a memory-intensive step0 followed by a less intensive step1, you could try something like the following:
# Memory intensive part
step0
# Lower ressource reservation
scontrol update JobId=$SLURM_JOBID MinMemoryCPU=6G
# Less intensive step
step1
Some general info there: https://slurm.schedmd.com/scontrol.html

SLURM error: sbatch: error: Batch job submission failed: Requested node configuration is not available

I'm trying to use a cluster to run an MPI code. the cluster hardware consist of 30 nodes, each with the following specs:
16 Cores at 2 Sockets (Intel Xeon e5-2650 v2) - (32 Cores with multithreading enabled)
64 GByte 1866 MT/s main memory
named: aria
the slurm config file is as following:
#SBATCH --ntasks=64 # Number of MPI ranks
#SBATCH --cpus-per-task=1 # Number of cores per MPI rank
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks-per-node=32 # How many tasks on each node
#SBATCH --ntasks-per-socket=16 # How many tasks on each CPU or socket
#SBATCH --mem-per-cpu=100mb # Memory per core
when I submit the job, a return message comes out with the following content: sbatch: error: Batch job submission failed: Requested node configuration is not available
which is a little bit confusing. I'm submitting one task per a cpu and dividing the tasks equally between nodes and sockets, can anyone please advise on the problem with the aforementioned configs? and one more thing: what is the optimum configuration given the hardware specs?
Thanks in advance
Look exactly what nodes offer with the sinfo -Nl command.
If could be that:
hyper threading is not enabled (which is often the case on HPC clusters)
or one core is reserved for Slurm and the Operating System
or hyper threading is enabled but Slurm is configured to schedule physical cores
As for optimal job configuration, it depends how 'optimal' is defined. For optimal time to solution, often it is better to let Slurm decide how to organise the ranks on the nodes because it will then be able to start your job sooner.
#SBATCH --ntasks=64 # Number of MPI ranks
#SBATCH --mem-per-cpu=100mb # Memory per core
For optimal job performance (in case of benchmarks, or cost analysis, etc.) you will need to take switches into accounts as well. (although with 30 nodes you probably have only one switch)
#SBATCH --ntasks=64 # Number of MPI ranks
#SBATCH --exclusive
#SBATCH --switches=1
#SBATCH --mem-per-cpu=100mb # Memory per core
Using --exclusive will make sure your job will not be bothered by other jobs.

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

Partitioning the RDD for Spark Jobs

When I submit job spark in yarn cluster I see spark-UI I get 4 stages of jobs but, memory used is very low in all nodes and it says 0 out of 4 gb used. I guess that might be because I left it in default partition.
Files size ranges are betweenr 1 mb to 100 mb in s3. There are around 2700 files with size of 26 GB. And exactly same 2700 jobs were running in stage 2.
Is it worth to repartition something around 640 partitons, would it improve the performace? or
It doesn't matter if partition is granular than actually required? or
My submit parameters needs to be addressed?
Cluster details,
Cluster with 10 nodes
Overall memory 500 GB
Overall vCores 64
--excutor-memory 16 g
--num-executors 16
--executor-cores 1
Actually it runs on 17 cores out of 64. I dont want to increase the number of cores since others might use the cluster.
You partition, and repartition for following reasons:
To make sure we have enough work to distribute to the distinct cores in our cluster (nodes * cores_per_node). Obviously we need to tune the number of executors, cores per executor, and memory per executor to make that happen as intended.
To make sure we evenly distribute work: the smaller the partitions, the lesser the chance than one core might have much more work to do than all other cores. Skewed distribution can have a huge effect on total lapse time if the partitions are too big.
To keep partitions in managable sizes. Not to big, and not to small so we dont overtax GC. Also bigger partitions might have issues when we have non-linear O.
To small partitions will create too much process overhead.
As you might have noticed, there will be a goldilocks zone. Testing will help you determine ideal partition size.
Note that it is ok to have much more partitions than we have cores. Queuing partitions to be assigned a task is something that I design for.
Also make sure you configure your spark job properly otherwise:
Make sure you do not have too many executors. One or Very Few executors per node is more than enough. Fewer executors will have less overhead, as they work in shared memory space, and individual tasks are handled by threads instead of processes. There is a huge amount of overhead to starting up a process, but Threads are pretty lightweight.
Tasks need to talk to each other. If they are in the same executor, they can do that in-memory. If they are in different executors (processes), then that happens over a socket (overhead). If that is over multiple nodes, that happens over a traditional network connection (more overhead).
Assign enough memory to your executors. When using Yarn as the scheduler, it will fit the executors by default by their memory, not by the CPU you declare to use.
I do not know what your situation is (you made the node names invisible), but if you only have a single node with 15 cores, then 16 executors do not make sense. Instead, set it up with One executor, and 16 cores per executor.

How many cores each Hadoop map task use?

I'm running Hadoop task on a cluster of YARN with max of 8 tasks and 16 cores.
When I run the job I see 8 tasks running on a node yet all 16 cores been used.
Is map task is multi threaded ?
Map task use more than 1 core ?
Can I know which cores used each map task ?
Thanks,
Assaf
You can configure the number of cores per map, as well as the maximum number of usable cores - see here.
The question sounds a bit confused, so, some more details which may be relevant:
A task might do more than just run a map, and, if you're running hadoop, you might be using the cores with something else in the system (ie, maybe some other process is using the cores).
A mapping task might use more than one mapper to do its job - that's part of the point of using hadoop and a MR architecture - your work will get auto-magically distributed and split for you.
Also, beware, your number of tasks doesn't directly relate to the number of mappers, cores or other resources in use; if what you're looking to do is limit cpu usage, or in any other way control resource allocation, change the properties of your containers.
For a more detailed discussion of resource allocation (esp. when compared to MR1) see here.

Resources