Spark job in Dataproc dynamic vs static allocation - apache-spark

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks

Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html

I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

Related

Can two executors / drivers from different Spark applications run on same node in cluster mode?

I read an article in Medium which claims that the number of executors + 1 (for driver), should be a multiple of 3, to efficiently utilize the core on a machine (16 cores, in this case, i.e, 5 per executor and 1 will be reserved for OS and node manager)
I am unable to validate this statement using experimenting on the cluster due to practical reasons. Did anybody try this? or have reference to code/documentation stating Yarn nodes will/not share cluster resources between another Spark application?
It's a big question, but in short - basing on the title and YARN in the text:
You get resources allocated by YARN that you requested via Spark(submit).
A Node has many Executors.
You cannot share an Executor at the same time, but the Executor can be relinquished if YARN Dynamic Resource Allocation is in effect, after a Stage has completed.
As a Node has many Executors, many Spark Apps can run their Tasks concurrently on the same Node, Worker, if they were granted those resources.

How can I configure YARN to allocate a minimum number of containers?

I am running a number of Spark jobs in parallel on a YARN cluster. I am finding that YARN is starting up a number of these jobs in parallel, but only allocating one container for the driver and no executors. This means that these Spark jobs are effectively sitting idle waiting for an executor to join, when this processing power could be better utilised by allocating executors to other jobs.
I would like to configure YARN to allocate a minimum of two containers (one driver + one executor) to a job, and if that's not available to keep it in the queue. How can I configure YARN in this way?
(I am running on an AWS EMR cluster with nearly all of the default settings.)
If your YARN uses FairScheduler, you can limit the number of applications running concurrently, and what percentage of a pool can be used by AMs (leaving the rest to the executors):
maxRunningApps: limit the number of apps from the queue to run at once
maxAMShare: limit the fraction of the queue’s fair share that can be
used to run application masters. This property can only be used for
leaf queues. For example, if set to 1.0f, then AMs in the leaf queue
can take up to 100% of both the memory and CPU fair share. The value
of -1.0f will disable this feature and the amShare will not be
checked. The default value is 0.5f.

apache spark executors and data locality

The spark literature says
Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads.
And If I understand this right, In static allocation the executors are acquired by the Spark application when the Spark Context is created on all nodes in the cluster (in a cluster mode). I have a couple of questions
If executors are acquired on all nodes and will stay allocated to
this application during the the duration of the whole application,
isn't there a chance a lot of nodes remain idle?
What is the advantage of acquiring resources when Spark context is
created and not in the DAGScheduler? I mean the application could be
arbitrarily long and it is just holding the resources.
So when the DAGScheduler tries to get the preferred locations and
the executors in those nodes are running the tasks, would it
relinquish the executors on other nodes?
I have checked a related question
Does Spark on yarn deal with Data locality while launching executors
But I'm not sure there is a conclusive answer
If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?
Yes there is chance. If you have data skew this will happen. The challenge is to tune the executors and executor core so that you get maximum utilization. Spark also provides dynamic resource allocation which ensures the idle executors are removed.
What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.
Spark tries to keep data in memory while doing transformation. Contrary to map-reduce model where after every Map operation it writes to disk. Spark can keep the data in memory only if it can ensure the code is executed in the same machine. This is the reason of allocating resource beforehand.
So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?
Spark can't start a task on an executor unless the executor is free. Now spark application master negotiates with the yarn to get the preferred location. It may or may not get that. If it doesn't get, it will start task in different executor.

Spark on Mesos is much slower than local

I'm running a Spark Streaming process on a 16 CPU's 64 GB RAM host with Mesos.
When I'm running it using Mesos as a cluster manager (by setting --master mesos://leader.mesos:5050) it's running much slower than when it is run in local mode (--master local[4]).
I can't find the reason for that and I have no clue. One of the things I've noticed is that there is one specific task that is taking significantly more time on Mesos than in Local.
The weird thing (maybe that should be the questions' title) is that the task itself takes 6s and its stage (it has only one stage) takes less than a second. See attached pictures (Mesos (1) and (2)). How come? Isn't a job equal to the sum of its parts?
Local:
Mesos:
(1)
(2)
Another note: I did manage to run this exact same Spark Streaming process on another Mesos cluster, and it runs in a sensible amount of time, pretty much like in the local mode described above. The only difference that I can think of is that this cluster has more than one host, and that Spark is running with 2 executors rather than 1. (I couldn't find a way to run more than 1 executor on the same host on Mesos). Is this may be the reason?
Any clues would be much appreciated.
Spark can run over Mesos in two modes: coarse-grained (default) and fine-grained (see documentation).
In coarse-grained mode Spark launches exactly one executor on each machine it was assigned to by Mesos. Inside this task Spark launches other mini-tasks. It has the benefit of lower startup overhead (in your case you don't want to change this mode).
Could you be more specific about your streaming job? Is it CPU, disk, or network bounded? You can easily compare performance if you run some of Spark examples.
If your task is CPU intensive you might consider setting spark.mesos.extra.cores. By default Spark tries to acquire all cores that are being offered by Mesos. So, if there's no other task running on that cluster it shouldn't be a problem.

Is it possible to run multiple Spark applications on a mesos cluster?

I have a Mesos cluster with 1 Master and 3 slaves (with 2 cores and 4GB RAM each) that has a Spark application already up and running. I wanted to run another application on the same cluster, as the CPU and Memory utilization isn't high. Regardless, when I try to run the new Application, I get the error:
16/02/25 13:40:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I guess the new process is not getting any CPU as the old one occupies all 6.
I have tried enabling dynamic allocation, making the spark app Fine grained. Assigning numerous combinations of executor cores and number of executors. What I am missing here? Is it possible to run a Mesos Cluster with multiple Spark Frameworks at all?
You can try setting spark.cores.max to limit the number of CPUs used by each Spark driver, which will free up some resources.
Docs: https://spark.apache.org/docs/latest/configuration.html#scheduling

Resources