In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job
Related
I know my code is free from antipatterns since I don't have any warnings in my Authoring code editor, so I know my code is doing PySpark operations that are distributed and scalable.
My current job has 2 executors assigned to it with 2 cores each, and it runs with task parallelism of 16 as seen on the Spark Details page.
How do I make this job run faster?
Your Executors are the pieces of Spark infrastructure assigned to 'execute' your work. As such, the more of these 'workers' you have, the more work you are able to do in parallel and the faster your job will be.
There's a limit to the amount your job will increase in speed however, and this is a function of the max number of tasks in your stages. Note: with AQE, your max number of tasks will increase as you increase your executor count, so you will notice the task counts increasing up to a certain point.
For instance, if my data scale is such that I only ever have a maximum of 8 tasks (let's assume AQE is controlling this), assigning an executor count to run more than 8 tasks will waste resources and won't increase your job speed (see above note that AQE may adjust your task counts as you add executors since it's detecting that more work can be run in parallel).
The job defaults in most Foundry environments are 2 executors with 2 cores each, and 1 core per task. This means your job is capable of running 4 cores at a time, which means 4 tasks.
This means if your max task counts per stage in your job is 4, you won't benefit from boosting your number of executors. If, however, you observe your stages have, for instance, 16 tasks, then you can choose to increase the number of executors in your job as such:
16 max tasks, 1 core per task. -> 16 cores needed.
2 cores per executor -> 8 executors max.
We could therefore jump this example job up to 8 executors for maximum performance.
For the original question, you would bump the number of executors to 8 for maximum performance, assuming AQE hasn't increased your task counts following.
When AQE re-examines your job and your new count of Executors, it will detect that more tasks can be run in parallel and will therefore increase your task counts to try to match the infrastructure. However, when it does this, you might end up with tasks that are smaller than you would like.
The way AQE decides how big to make the tasks (and therefore how many tasks it will run with) is based on the setting spark.sql.adaptive.advisoryPartitionSizeInBytes and the total number of cores available in your job. If you have more cores than would be worth parallelizing (i.e. the shuffle partitions are too small), then these small partitions will be coalesced into a fewer count which will mean you then have the same wasted executor problem without AQE.
AQE will do the best it can with the executor counts you've given it, so you may see it get faster and faster with more executors up to a point. At the point more executors doesn't mean a faster job, this is because your partition sizes are too small to be worth splitting into smaller tasks, and you've started wasting executors
What happens when number of spark tasks be greater than the executor core? How is this scenario handled by Spark
Is this related to this question?
Anyway, you can check this Cloudera How-to. In "Tuning Resource Allocation" section, It's explained that a spark application can request executors by turning on the dynamic allocation property. It's also important to set cluster properties such as num-executors, executor-cores, executor-memory... so that spark requests fit into what your resource manager has available.
yes, this scenario can happen. In this case some of the cores will be idle. Scenarios where this can happen:
You call coalesce or repartition with a number of partitions < number of cores
you use the default number of spark.sql.shuffle.partitions (=200)
and you have more than 200 cores available. This will be an issue for
joins, sorting and aggregation. In this case you may want to increase spark.sql.shuffle.partitions
Note that even if you have enough tasks, some (or most) of them could be empty. This can happen if you have a large data skew or you do something like groupBy() or Window without a partitionBy. In this case empty partitions will be finished immediately, turning most of your cores idle
I think the question is a little off beam. It is unlikely what you ask. Why?
With a lot of data you will have many partitions and you may repartition.
Say you have 10,000 partitions which equates to 10,000 tasks.
An executor (core) will serve a partition effectively a task (1:1 mapping) and when finished move on to the next task, until all tasks finished in the stage and then next will start (if it is in plan / DAG).
It's more likely you will not have a cluster of 10,000 executor cores at most places (for your App), but there are sites that have that, that is true.
If you have more cores allocated than needed, then they remain idle and non-usable for others. But with dynamic resource allocation, executors can be relinquished. I have worked with YARN and Spark Standalone, how this is with K8 I am not sure.
Transformations alter what you need in terms of resources. E.g. an order by may result in less partitions and thus may contribute to idleness.
Suppose I have an RDD with 1,000 elements and 10 executors. Right now I parallelize the RDD with 10 partitions and process 100 elements by each executor (assume 1 task per executor).
My difficulty is that some of these partitioned tasks may take much longer than others, so say 8 executors will be done quickly, while the remaining 2 will be stuck doing something for longer. So the master process will be waiting for the 2 to finished before moving on, and 8 will be idling.
What would be a way to make the idling executors 'take' some work from the busy ones? Unfortunately I can't anticipate ahead of time which ones will end up 'busier' than others, so can't balance the RDD properly ahead of time.
Can I somehow make executors communicate with each other programmatically? I was thinking of sharing a DataFrame with the executors, but based on what I see I cannot manipulate a DataFrame inside an executor?
I am using Spark 2.2.1 and JAVA
Try using spark dynamic resource allocation, which scales the number of executors registered with the application up and down based on the workload.
You can endable the below properties
spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true
You can consider to configure the below properties as well
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.minExecutors
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?
I have two questions around performance tuning in Spark:
I understand one of the key things for controlling parallelism in the spark job is the number of partitions that exist in the RDD that is being processed, and then controlling the executors and cores processing these partitions. Can I assume this to be true:
# of executors * # of executor cores shoud be <= # of partitions. i.e to say one partition is always processed in one core of one executor. There is no point having more executors*cores than the number of partitions
I understand that having a high number of cores per executor can have a -ve impact on things like HDFS writes, but here's my second question, purely from a data processing point of view what is the difference between the two? For e.g. if I have 10 node cluster what would be the difference between these two jobs (assuming there's ample memory per node to process everything):
5 executors * 2 executor cores
2 executors * 5 executor cores
Assuming there's infinite memory and CPU, from a performance point of view should we expect the above two to perform the same?
Most of the time using larger executors (more memory, more cores) are better. One: larger executor with large memory can easily support broadcast joins and do away with shuffle. Second: since tasks are not created equal, statistically larger executors have better chance of surviving OOM issues.
The only problem with large executors is GC pauses. G1GC helps.
In my experience, if I had a cluster with 10 nodes, I would go for 20 spark executors. The details of the job matter a lot, so some testing will help determine the optional configuration.