apache spark executors and data locality - apache-spark

The spark literature says
Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads.
And If I understand this right, In static allocation the executors are acquired by the Spark application when the Spark Context is created on all nodes in the cluster (in a cluster mode). I have a couple of questions
If executors are acquired on all nodes and will stay allocated to
this application during the the duration of the whole application,
isn't there a chance a lot of nodes remain idle?
What is the advantage of acquiring resources when Spark context is
created and not in the DAGScheduler? I mean the application could be
arbitrarily long and it is just holding the resources.
So when the DAGScheduler tries to get the preferred locations and
the executors in those nodes are running the tasks, would it
relinquish the executors on other nodes?
I have checked a related question
Does Spark on yarn deal with Data locality while launching executors
But I'm not sure there is a conclusive answer

If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?
Yes there is chance. If you have data skew this will happen. The challenge is to tune the executors and executor core so that you get maximum utilization. Spark also provides dynamic resource allocation which ensures the idle executors are removed.
What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.
Spark tries to keep data in memory while doing transformation. Contrary to map-reduce model where after every Map operation it writes to disk. Spark can keep the data in memory only if it can ensure the code is executed in the same machine. This is the reason of allocating resource beforehand.
So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?
Spark can't start a task on an executor unless the executor is free. Now spark application master negotiates with the yarn to get the preferred location. It may or may not get that. If it doesn't get, it will start task in different executor.

Related

Spark job in Dataproc dynamic vs static allocation

I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.

Doubts related to Spark resource usage

I am executing a Spark Streaming application and I am caching the rdds for history look-back, my batch is of duration one minute and average processing time is 14 seconds, So executors are not computing for whole batch duration. So are executors, are still hold up as I am caching the rdds in memory. And if executors are hold up should we consider this holding up of executors is wastage of resources.
It depends of what you want to achieve.
In Spark 2.0, dynamic allocation is enabled to Spark Streaming with no Bugs.
What was the problem, if you have a huge workload of data if you don't keep at least one executor for the data receive you may lose data. Now this is solved with Spark 2.0 and the release of the data is working.
What is the advantage of keeping your data in cache when a huge amount of data comes? You can process your data without a shuffle, it can increase your response time.
But, if you have a process of 1 by 1 minute. And it takes just 14 seconds to process your data in an average time. I suggest you to release your data and release your workers to open space for other tasks.
If you will not have enough resources for your tasks, the tasks will be queued and will be handled as soon you have the resources.
What is the risk? If you release the workers could be hard to get the resources back if you don't have preemption in your Yarn. This can be a waste of resource depends of your cluster.
What I would do is: create some queues that can handle your jobs. Set the High priority queue, set your streaming there, the other jobs in other queues and turn on the Dynamic Allocation and release the cache. If your application needs something with more resources the Yarn will handle it.

Limit Spark application from grabbing all the resources in a YARN cluster

We (an engineering team) are running an EMR cluster with YARN and Spark. What is typically happening is that when one user submits a heavy memory intensive job, it grabs all the YARN available memory and then all the subsequent users submitted jobs have to wait for that memory to clear (I know that autoscaling will solve this problem to a certain extent and we are looking into that, but we would like to avoid a single user occupying all the memory even when the cluster is autoscaled to it's full limits).
Is there a way to configure YARN such that any application (Spark or otherwise) may not occupy more than, say 75% of available memory?
Thanks
According to the documentation, you can manage the amount of memory allocated to an executor using the parameter: spark.executor.memory

Is it possible to run multiple Spark applications on a mesos cluster?

I have a Mesos cluster with 1 Master and 3 slaves (with 2 cores and 4GB RAM each) that has a Spark application already up and running. I wanted to run another application on the same cluster, as the CPU and Memory utilization isn't high. Regardless, when I try to run the new Application, I get the error:
16/02/25 13:40:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I guess the new process is not getting any CPU as the old one occupies all 6.
I have tried enabling dynamic allocation, making the spark app Fine grained. Assigning numerous combinations of executor cores and number of executors. What I am missing here? Is it possible to run a Mesos Cluster with multiple Spark Frameworks at all?
You can try setting spark.cores.max to limit the number of CPUs used by each Spark driver, which will free up some resources.
Docs: https://spark.apache.org/docs/latest/configuration.html#scheduling

What is the relationship between workers, worker instances, and executors?

In Spark Standalone mode, there are master and worker nodes.
Here are few questions:
Does 2 worker instance mean one worker node with 2 worker processes?
Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor?
Is there a flow chart explaining how spark works on runtime, such as word count?
Extending to other great answers, I would like to describe with few images.
In Spark Standalone mode, there are master node and worker nodes.
If we represent both master and workers(each worker can have multiple executors if CPU and memory are available) at one place for standalone mode.
If you are curious about how Spark works with YARN? check this post Spark on YARN
1. Does two worker instance mean one worker node with two worker processes?
In general, we call worker instance as a slave as it's a process to execute spark tasks/jobs. Suggested mapping for a node(a physical or virtual machine) and a worker is,
1 Node = 1 Worker process
2. Does every worker instance hold an executor for the specific application (which manages storage, task) or one worker node holds one executor?
Yes, A worker node can be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage.
Check the Worker node in the given image.
BTW, the Number of executors in a worker node at a given point of time entirely depends on workload on the cluster and capability of the node to run how many executors.
3. Is there a flow chart explaining how spark runtime?
If we look at the execution from Spark perspective over any resource manager for a program, which join two rdds and do some reduce operation then filter
HIH
I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes.
Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more than one worker per machine. So two worker nodes typically means two machines, each a Spark worker.
Workers hold many executors, for many applications. One application has executors on many workers.
Your third question is not clear.
I know this is an old question and Sean's answer was excellent. My writeup is about the SPARK_WORKER_INSTANCES in MrQuestion's comment. If you use Mesos or YARN as your cluster manager, you are able to run multiple executors on the same machine with one worker, thus there is really no need to run multiple workers per machine. However, if you use standalone cluster manager, currently it still only allows one executor per worker process on each physical machine. Thus in case you have a super large machine and would like to run multiple exectuors on it, you have to start more than 1 worker process. That's what SPARK_WORKER_INSTANCES in the spark-env.sh is for. The default value is 1. If you do use this setting, make sure you set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.
This standalone cluster manager limitation should go away soon. According to this SPARK-1706, this issue will be fixed and released in Spark 1.4.
As Lan was saying, the use of multiple worker instances is only relevant in standalone mode. There are two reasons why you want to have multiple instances: (1) garbage pauses collector can hurt throughput for large JVMs (2) Heap size of >32 GB can’t use CompressedOoops
Read more about how to set up multiple worker instances.

Resources