how spark dynamic allocation clears queued task - apache-spark

I can open new vm on the fly based on that I am asking this question.
I am using spark dynamic allocation when I used
spark.dynamicAllocation.minExecutors=10 on sudden burst of data spark opens new executors very slowly results in long queues
when I changed spark.dynamicAllocation.minExecutors=200 to larger number on sudden burst it opens new executors very fast and the queue can clear up.
My question is do we have to set this to high value for such situation.

Vipin,
When you set the Dynamic Allocation in spark, as I can see you enable it and set the min of executors. But, when you need 200 executor to be faster the allocation has one configuration called spark.dynamicAllocation.schedulerBacklogTimeout this for default has 1s of timeout.
This mean that after 1s if you task didn't finished a task it will allocate more executors.
According to the documentation in spark, that says:
Spark requests executors in rounds. The actual request is triggered when there have been pending tasks for spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then triggered again every spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds thereafter if the queue of pending tasks persists. Additionally, the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds.
So for every seccond Spark allocate 2^n for n secconds of delay. To reach the 200 executor you need to wait at least 8 secconds to request the executors to Yarn. And few more secconds to solve that.
Maybe if you raise the number of cores it will help you. But if you are using the full cores of each node... Well thre is no solution.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

How do I make my Spark job run faster using executors?

I know my code is free from antipatterns since I don't have any warnings in my Authoring code editor, so I know my code is doing PySpark operations that are distributed and scalable.
My current job has 2 executors assigned to it with 2 cores each, and it runs with task parallelism of 16 as seen on the Spark Details page.
How do I make this job run faster?
Your Executors are the pieces of Spark infrastructure assigned to 'execute' your work. As such, the more of these 'workers' you have, the more work you are able to do in parallel and the faster your job will be.
There's a limit to the amount your job will increase in speed however, and this is a function of the max number of tasks in your stages. Note: with AQE, your max number of tasks will increase as you increase your executor count, so you will notice the task counts increasing up to a certain point.
For instance, if my data scale is such that I only ever have a maximum of 8 tasks (let's assume AQE is controlling this), assigning an executor count to run more than 8 tasks will waste resources and won't increase your job speed (see above note that AQE may adjust your task counts as you add executors since it's detecting that more work can be run in parallel).
The job defaults in most Foundry environments are 2 executors with 2 cores each, and 1 core per task. This means your job is capable of running 4 cores at a time, which means 4 tasks.
This means if your max task counts per stage in your job is 4, you won't benefit from boosting your number of executors. If, however, you observe your stages have, for instance, 16 tasks, then you can choose to increase the number of executors in your job as such:
16 max tasks, 1 core per task. -> 16 cores needed.
2 cores per executor -> 8 executors max.
We could therefore jump this example job up to 8 executors for maximum performance.
For the original question, you would bump the number of executors to 8 for maximum performance, assuming AQE hasn't increased your task counts following.
When AQE re-examines your job and your new count of Executors, it will detect that more tasks can be run in parallel and will therefore increase your task counts to try to match the infrastructure. However, when it does this, you might end up with tasks that are smaller than you would like.
The way AQE decides how big to make the tasks (and therefore how many tasks it will run with) is based on the setting spark.sql.adaptive.advisoryPartitionSizeInBytes and the total number of cores available in your job. If you have more cores than would be worth parallelizing (i.e. the shuffle partitions are too small), then these small partitions will be coalesced into a fewer count which will mean you then have the same wasted executor problem without AQE.
AQE will do the best it can with the executor counts you've given it, so you may see it get faster and faster with more executors up to a point. At the point more executors doesn't mean a faster job, this is because your partition sizes are too small to be worth splitting into smaller tasks, and you've started wasting executors

Why is there a delay in the launch of spark executors?

While trying to optimise a Spark job, I am having trouble understanding a delay of 3-4s in the launch of the second and of 6-7s third and fourth executors.
This is what I'm working with:
Spark 2.2
Two worker nodes having 8 cpu cores each. (master node separate)
Executors are configured to use 3 cores each.
Following is the screenshot of the jobs tab in Spark UI.
The job is divided into three stages. As seen, second, third and fourth executors are added only during the second stage.
Following is the snap of the Stage 0.
And following the snap of the Stage 1.
As seen in the image above, executor 2 (on the same worker as the first) takes around 3s to launch. Executors 3 and 4 (on the second worker) taken even longer, approximately 6s.
I tried playing around with the spark.locality.wait variable : values of 0s, 1s, 1ms. But there does not seem to be any change in the launch times of the executors.
Is there some other reason for this delay? Where else can I look to understand this better?
You might be interested to check Spark's executor request policy, and review the settings spark.dynamicAllocation.schedulerBacklogTimeout and spark.dynamicAllocation.sustainedSchedulerBacklogTimeout for your application.
A Spark application with dynamic allocation enabled requests
additional executors when it has pending tasks waiting to be
scheduled. ...
Spark requests executors in rounds. The actual request is triggered
when there have been pending tasks for
spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then
triggered again every
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds
thereafter if the queue of pending tasks persists. Additionally, the
number of executors requested in each round increases exponentially
from the previous round. For instance, an application will add 1
executor in the first round, and then 2, 4, 8 and so on executors in
the subsequent rounds.
Another potential source for a delay could be spark.locality.wait. Since in Stage 1 you have quite a bit of tasks with sub-optimal locality levels (Rack local: 59), and the default for spark.locality.wait is 3 seconds, it could actually be the primary reason for the delays that you're seeing.
It takes time for the yarn to create the executors, Nothing can be done about this overhead. If you want to optimize you can set up a Spark server and then create requests for the server, And this saves the warm up time.

Spark Streaming memory usage doubts

I am trying to figure out the memory used by executors for a Spark Streaming job. For data I am using the rest endpoint for Spark AllExecutors
and just summing up the metrics totalDuration * spark.executor.memory for every executor and then emitting the final sum as the memory usage.
But this is coming out to be very small for application which ran whole day , is something wrong with the logic.Also I am using dynamic allocation and executorIdleTimeout is 5 seconds.
Also I am also assuming that if some executor was removed for due to idle timeout and then was allocated to some other task then its totalDuration will be increased by the amount of time took by the executor to execute this new task.
But what is unusual is that as the duration for execution of increases the memory usage metrics is decreasing.

Max Executor failures in Spark dynamic allocation

I am using dynamic allocation feature of spark to run my spark job. It allocates around 50-100 executors. For some reason few executors are lost resulting in shutting down the job. Log shows that this happened due to max executor failures reached. It is set to 3 by default. Hence when 3 executors are lost the job gets killed even if other 40-50 executors are running.
I know that I can change the max executor failure limit but this seems like a workaround. Is there something else that I can try. All suggestions are welcome.

Resources