Slurm manages a cluster with 8core/64GB ram and 16core/128GB ram nodes.
There is a low-priority "long" partition and a high-priority "short" partition.
Jobs running in the long partition can be suspended by jobs in the short partition, in which case pages from the suspended job get mostly pushed to swap. (Swap usage is intended for this purpose only, not for active jobs.)
How can I configure in slurm the total amount of RAM+swap available in each node for jobs?
There is the MaxMemPerNode parameter, but that is a partition property and thus cannot accommodate different values for different nodes in the partition.
There is the MaxMemPerCPU parameter, but that prevents low-memory jobs to share unused memory with big-memory jobs.
You need to specify the memory of each node using the RealMemory parameter in the node definition (see the slurm.conf manpage)
Related
I am trying to understand the relationship between different components and elements in the Spark architecture but am unable to get a grip on it. Can someone please validate my assumptions and correct me where I am wrong.
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers.
Q - Can a node have multiple drivers (if I have multiple applications)?
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
Q. What metric determines the number of executors per worker?
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
Q. What is the relationship between a core and an executor?
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
For e.g., if I have a cluster with 2 nodes, 10 executors (5 executors in each node) and a dataframe with 20 partitions, I'm assuming I would have 2 partitions in each executor or is there a chance that partition distribution could be skewed? What would I need to do to ensure that all my partitions that have a certain partitioning key get co-located within the same worker so there is minimum network transfer when these partitions have to work together to, say, perform an aggregation or a join?
Q - What happens when a repartition() is performed. For e.g., if I have 20 partitions across 10 executors (say, 2 partitions in each) and I repartition(2). I will now have only 2 portions of data which I assume would be resting in a couple of executors. What happens to the remaining executors?
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
Q - Are these tasks performed by individual executors?
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors?
Thanks in advance!
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers. (This is Correct as a starting Point)
Q - Can a node have multiple drivers (if I have multiple applications)?
Yes Because driver is just a process that gets created based on the program that you might have written. And you can have multiple process running on the same node.
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
your understanding here seems wrong because worker is actually a node or machine. Either you say it worker or worker node both are same
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
An executor is a process inside the worker node and a single worker node can have multiple executors
Q. What metric determines the number of executors per worker?
configuration(Number of cores and memory) of your worker node decides what is the max executors it can run on any specific worker node.
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
It is associated with the executor process. Spark executor is a single JVM instance on a node that serves a single spark application
Q. What is the relationship between a core and an executor?
Core property controls the number of concurrent tasks an executor can run. For example if you request 2 executor each with 2 cores then you can run 4 concurrent tasks at the same time during your job execution.
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
Generally spark perform all its computation in memory. RAM is allocated at the executor level and HD would be allocated at the Worker node level only. Spark would just spill the data to the disk only when it does not fit in memory
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
These partitions could be anywhere and might not be equally distributed in most of the cases.It could happen some of the executors does not have a single partition and other executors have more than 2 partitions.
In order to have colocated partitions or partitions that have same keys you would have to repartition data based on the specific column in your dataframe and then it would partition your data based on the values of that column and make sure that same column values are there in the same partition
When you repartition the data to 2 partitions then it would shuffle the data between all the executors and then break the dat into 2 partitions and then that data could be on any of the executors and other executors would be empty or idle in that case.
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
you would have 20 tasks for that specific stage and it wont remain same for all the stages as stage gets created when there is data shuffle that needs to happen. If there is no shuffle happening based on the code that you might have written it would just create a single stage with 20 tasks for sure.
Q - Are these tasks performed by individual executors? Yes
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors? Yes
I want to specify max amount of memory per core for a batch job in slurm
i can see two sbatch memory options:
--mem=MB maximum amount of real memory per node required by the job.
--mem-per-cpu=mem amount of real memory per allocated CPU required by the job.
either of those options suits my needs
any suggestions how to achive this goal
You can use --mem=MaxMemPerNode to use the maximum allowed memory for the job in that node. if configured in the cluster, you can see the value MaxMemPerNode using scontrol show config.
A special case, setting --mem=0 will also give the job access to all of the memory on each node.(This is not ideal in a heterogeneous cluster since, the lowest memory value among the nodes will only be used for all the allocated nodes).
If configured in the cluster, --mem-per-cpu=MaxMemPerCPU can be used to enable using the maximum allowed memory per cpu.
If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.
I wondered that :
spark use the spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead to store what kind of data?
And in which case i should boost the value of spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead?
In YARN terminology, executors and application masters run inside “containers”. Spark offers yarn specific properties so you can run your application :
spark.yarn.executor.memoryOverhead is the amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
spark.yarn.driver.memoryOverhead is the amount of off-heap memory (in megabytes) to be allocated per driver in cluster mode with the memory properties as the executor's memoryOverhead.
So it's not about storing data, it's just the resources needed for YARN to run properly.
In some cases,
e.g if you enable dynamicAllocation you might want to set these properties explicitly along with the maximum number of executor (spark.dynamicAllocation.maxExecutors) that can be created during the process which can easily overwhelm YARN by asking for thousands of executors and thus loosing the already running executors.
spark.dynamicAllocation.maxExecutors is set to infinity by default which set the upper bound for the number of executors if dynamic allocation is enabled. [ref.http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation]
According to the code documentation : [ref.https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L43]
Increasing the target number of executors happens in response to backlogged tasks waiting to be scheduled. If the scheduler queue is not drained in N seconds, then new executors are added. If the queue persists for another M seconds, then more executors are added and so on. The number added in each round increases exponentially from the previous round until an upper bound has been reached. The upper bound is based both on a configured property and on the current number of running and pending tasks, as described above.
This can lead into an exponential increase of the number of executors in some cases which can break the YARN resource manager. In my case :
16/03/31 07:15:44 INFO ExecutorAllocationManager: Requesting 8000 new executors because tasks are backlogged (new desired total will be 40000)
This doesn't cover all the use case which one can use those property, but it gives a general idea about how it's been used.
I am talking about the standalone mode of spark.
Lets say the value of SPARK_WORKER_CORES=200 and there are only 4 cores available on the node where I am trying to start the worker. Will the worker get 4 cores and continue or will the worker not start at all ?
A similar case, If I set SPARK_WORKER_MEMORY=32g and there is only 2g of memory actually available on that node ?
"Cores" in Spark is sort of a misnomer. "Cores" actually corresponds to the number of threads created to process data. So, you could have an arbitrarily large number of cores without an explicit failure. That being said, overcommitting by 50x will likely lead to incredibly poor performance due to context switching and overhead costs. This means that for both workers and executors you can arbitrarily increase this number. In practice in Spark Standalone, I've generally seen this overcommitted no more than 2-3x the number of logical cores.
When it comes to specifying worker memory, once again, you can in theory increase it to an arbitrarily large number. This is because, for a worker, the memory amount specifies how much it is allowed to allocate for executors, but it doesn't explicitly allocate that amount when you start the worker. Therefore you can make this value much larger than physical memory.
The caveat here is that when you start up an executor, if you set the executor memory to be greater than the amount of physical memory, your executors will fail to start. This is because executor memory directly corresponds to the -Xmx setting of the java process for that executor.