How can I configure YARN to allocate a minimum number of containers? - apache-spark

I am running a number of Spark jobs in parallel on a YARN cluster. I am finding that YARN is starting up a number of these jobs in parallel, but only allocating one container for the driver and no executors. This means that these Spark jobs are effectively sitting idle waiting for an executor to join, when this processing power could be better utilised by allocating executors to other jobs.
I would like to configure YARN to allocate a minimum of two containers (one driver + one executor) to a job, and if that's not available to keep it in the queue. How can I configure YARN in this way?
(I am running on an AWS EMR cluster with nearly all of the default settings.)

If your YARN uses FairScheduler, you can limit the number of applications running concurrently, and what percentage of a pool can be used by AMs (leaving the rest to the executors):
maxRunningApps: limit the number of apps from the queue to run at once
maxAMShare: limit the fraction of the queue’s fair share that can be
used to run application masters. This property can only be used for
leaf queues. For example, if set to 1.0f, then AMs in the leaf queue
can take up to 100% of both the memory and CPU fair share. The value
of -1.0f will disable this feature and the amShare will not be
checked. The default value is 0.5f.

Related

Can two executors / drivers from different Spark applications run on same node in cluster mode?

I read an article in Medium which claims that the number of executors + 1 (for driver), should be a multiple of 3, to efficiently utilize the core on a machine (16 cores, in this case, i.e, 5 per executor and 1 will be reserved for OS and node manager)
I am unable to validate this statement using experimenting on the cluster due to practical reasons. Did anybody try this? or have reference to code/documentation stating Yarn nodes will/not share cluster resources between another Spark application?
It's a big question, but in short - basing on the title and YARN in the text:
You get resources allocated by YARN that you requested via Spark(submit).
A Node has many Executors.
You cannot share an Executor at the same time, but the Executor can be relinquished if YARN Dynamic Resource Allocation is in effect, after a Stage has completed.
As a Node has many Executors, many Spark Apps can run their Tasks concurrently on the same Node, Worker, if they were granted those resources.

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

How SPARK_WORKER_CORES setting impacts concurrency in Spark Standalone

I am using a Spark 2.2.0 cluster configured in Standalone mode. Cluster has 2 octa core machines. This cluster is exclusively for Spark jobs and no other process uses them. I have around 8 Spark Streaming apps which run on this cluster.I explicitly set SPARK_WORKER_CORES (in spark-env.sh) to 8 and allocate one core to each app using total-executor-cores setting. This config reduces the capability to work in parallel on multiple tasks. If a stage works on a partitioned RDD with 200 partitions, only one task executes at a time. What I wanted Spark to do was to start separate thread for each job and process in parallel. But I couldn't find a separate Spark setting to control the number of threads.So, I decided to play around and bloated the number of cores (i.e. SPARK_WORKER_CORES in spark-env.sh) to 1000 on each machine. Then I gave 100 cores to each Spark application. I found that spark started processing 100 partitons in parallel this time indicating that 100 threads were being used.I am not sure if this is the correct method of impacting the number of threads used by a Spark job.
You mixed up two things:
Cluster manger properties - SPARK_WORKER_CORES - total number of cores that worker can offer. Use it to control a fraction of resources that should be used by Spark in total
Application properties --total-executor-cores / spark.cores.max - number of cores that application requests from the cluster manager. Use it control in-app parallelism.
Only the second on is directly responsible for app parallelism as long as, the first one is not limiting.
Also CORE in Spark is a synonym of thread. If you:
allocate one core to each app using total-executor-cores setting.
then you specifically assign a single data processing thread.

What are the benefits of running multiple Spark instances per node (physical machine)?

Is there any advantage to starting more than one spark instance (master or worker) on a particular machine/node?
The spark standalone documentation doesn't explicitly say anything about starting a cluster or multiple workers on the same node. It does seem to implicitly conflate that one worker equals one node
Their hardware provisioning page says:
Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES.
So aside from working with large amounts of memory or testing cluster configuration, is there any benefit to running more than one worker per node?
I think the obvious benefit is to improve the resource utilization of the hardware per box without losing performance. In terms of parallelism, one big executor with multiple cores seems to be same with multiple executors with less cores.

What is the relationship between workers, worker instances, and executors?

In Spark Standalone mode, there are master and worker nodes.
Here are few questions:
Does 2 worker instance mean one worker node with 2 worker processes?
Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor?
Is there a flow chart explaining how spark works on runtime, such as word count?
Extending to other great answers, I would like to describe with few images.
In Spark Standalone mode, there are master node and worker nodes.
If we represent both master and workers(each worker can have multiple executors if CPU and memory are available) at one place for standalone mode.
If you are curious about how Spark works with YARN? check this post Spark on YARN
1. Does two worker instance mean one worker node with two worker processes?
In general, we call worker instance as a slave as it's a process to execute spark tasks/jobs. Suggested mapping for a node(a physical or virtual machine) and a worker is,
1 Node = 1 Worker process
2. Does every worker instance hold an executor for the specific application (which manages storage, task) or one worker node holds one executor?
Yes, A worker node can be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage.
Check the Worker node in the given image.
BTW, the Number of executors in a worker node at a given point of time entirely depends on workload on the cluster and capability of the node to run how many executors.
3. Is there a flow chart explaining how spark runtime?
If we look at the execution from Spark perspective over any resource manager for a program, which join two rdds and do some reduce operation then filter
HIH
I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes.
Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more than one worker per machine. So two worker nodes typically means two machines, each a Spark worker.
Workers hold many executors, for many applications. One application has executors on many workers.
Your third question is not clear.
I know this is an old question and Sean's answer was excellent. My writeup is about the SPARK_WORKER_INSTANCES in MrQuestion's comment. If you use Mesos or YARN as your cluster manager, you are able to run multiple executors on the same machine with one worker, thus there is really no need to run multiple workers per machine. However, if you use standalone cluster manager, currently it still only allows one executor per worker process on each physical machine. Thus in case you have a super large machine and would like to run multiple exectuors on it, you have to start more than 1 worker process. That's what SPARK_WORKER_INSTANCES in the spark-env.sh is for. The default value is 1. If you do use this setting, make sure you set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.
This standalone cluster manager limitation should go away soon. According to this SPARK-1706, this issue will be fixed and released in Spark 1.4.
As Lan was saying, the use of multiple worker instances is only relevant in standalone mode. There are two reasons why you want to have multiple instances: (1) garbage pauses collector can hurt throughput for large JVMs (2) Heap size of >32 GB can’t use CompressedOoops
Read more about how to set up multiple worker instances.

Resources