Spark executor & tasks concurrency - apache-spark

In Spark, an executor may run many tasks concurrently maybe 2 or 5 or 6 .
How Spark figures out (or calculate) the number of tasks to be run in the same executor concurrently i.e how many tasks can run in an executor concurrently?
An executor may be executing one task but one more task maybe be placed to run concurrently on same executor? What's the criteria for that?
An executor has fixed number of cores & memory. As we do not specify memory & cores requirements for task in Spark, how to calculate how many can run concurrently in an executor?

The number of tasks run parallely within an executor = number of cores configured.
You can always change this number through configuration.
The total number of tasks run by executor overall ( parallel or sequential) depends upon the total number of tasks created ( through number of splits) and through number of executors.
All tasks running in one executor share the same memory configured. Inside, it just launches as many threads as number of cores.

One most probable issue could be the skewed partitions in the RDD you are processing. If 2-6 partitions are having a lot of data on them, then in order to reduce data shuffle over the network, Spark will try that the executors process the data residing locally on their own nodes. So you'll see those 2-6 executors working for a long time and the others would be done with there data in few milliseconds.
You can find more about this in this stackoverflow question.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

How do I make my Spark job run faster using executors?

I know my code is free from antipatterns since I don't have any warnings in my Authoring code editor, so I know my code is doing PySpark operations that are distributed and scalable.
My current job has 2 executors assigned to it with 2 cores each, and it runs with task parallelism of 16 as seen on the Spark Details page.
How do I make this job run faster?
Your Executors are the pieces of Spark infrastructure assigned to 'execute' your work. As such, the more of these 'workers' you have, the more work you are able to do in parallel and the faster your job will be.
There's a limit to the amount your job will increase in speed however, and this is a function of the max number of tasks in your stages. Note: with AQE, your max number of tasks will increase as you increase your executor count, so you will notice the task counts increasing up to a certain point.
For instance, if my data scale is such that I only ever have a maximum of 8 tasks (let's assume AQE is controlling this), assigning an executor count to run more than 8 tasks will waste resources and won't increase your job speed (see above note that AQE may adjust your task counts as you add executors since it's detecting that more work can be run in parallel).
The job defaults in most Foundry environments are 2 executors with 2 cores each, and 1 core per task. This means your job is capable of running 4 cores at a time, which means 4 tasks.
This means if your max task counts per stage in your job is 4, you won't benefit from boosting your number of executors. If, however, you observe your stages have, for instance, 16 tasks, then you can choose to increase the number of executors in your job as such:
16 max tasks, 1 core per task. -> 16 cores needed.
2 cores per executor -> 8 executors max.
We could therefore jump this example job up to 8 executors for maximum performance.
For the original question, you would bump the number of executors to 8 for maximum performance, assuming AQE hasn't increased your task counts following.
When AQE re-examines your job and your new count of Executors, it will detect that more tasks can be run in parallel and will therefore increase your task counts to try to match the infrastructure. However, when it does this, you might end up with tasks that are smaller than you would like.
The way AQE decides how big to make the tasks (and therefore how many tasks it will run with) is based on the setting spark.sql.adaptive.advisoryPartitionSizeInBytes and the total number of cores available in your job. If you have more cores than would be worth parallelizing (i.e. the shuffle partitions are too small), then these small partitions will be coalesced into a fewer count which will mean you then have the same wasted executor problem without AQE.
AQE will do the best it can with the executor counts you've given it, so you may see it get faster and faster with more executors up to a point. At the point more executors doesn't mean a faster job, this is because your partition sizes are too small to be worth splitting into smaller tasks, and you've started wasting executors

What is the relationship between a Node, Worker, Executor, Task and Partition

I am trying to understand the relationship between different components and elements in the Spark architecture but am unable to get a grip on it. Can someone please validate my assumptions and correct me where I am wrong.
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers.
Q - Can a node have multiple drivers (if I have multiple applications)?
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
Q. What metric determines the number of executors per worker?
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
Q. What is the relationship between a core and an executor?
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
For e.g., if I have a cluster with 2 nodes, 10 executors (5 executors in each node) and a dataframe with 20 partitions, I'm assuming I would have 2 partitions in each executor or is there a chance that partition distribution could be skewed? What would I need to do to ensure that all my partitions that have a certain partitioning key get co-located within the same worker so there is minimum network transfer when these partitions have to work together to, say, perform an aggregation or a join?
Q - What happens when a repartition() is performed. For e.g., if I have 20 partitions across 10 executors (say, 2 partitions in each) and I repartition(2). I will now have only 2 portions of data which I assume would be resting in a couple of executors. What happens to the remaining executors?
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
Q - Are these tasks performed by individual executors?
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors?
Thanks in advance!
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers. (This is Correct as a starting Point)
Q - Can a node have multiple drivers (if I have multiple applications)?
Yes Because driver is just a process that gets created based on the program that you might have written. And you can have multiple process running on the same node.
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
your understanding here seems wrong because worker is actually a node or machine. Either you say it worker or worker node both are same
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
An executor is a process inside the worker node and a single worker node can have multiple executors
Q. What metric determines the number of executors per worker?
configuration(Number of cores and memory) of your worker node decides what is the max executors it can run on any specific worker node.
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
It is associated with the executor process. Spark executor is a single JVM instance on a node that serves a single spark application
Q. What is the relationship between a core and an executor?
Core property controls the number of concurrent tasks an executor can run. For example if you request 2 executor each with 2 cores then you can run 4 concurrent tasks at the same time during your job execution.
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
Generally spark perform all its computation in memory. RAM is allocated at the executor level and HD would be allocated at the Worker node level only. Spark would just spill the data to the disk only when it does not fit in memory
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
These partitions could be anywhere and might not be equally distributed in most of the cases.It could happen some of the executors does not have a single partition and other executors have more than 2 partitions.
In order to have colocated partitions or partitions that have same keys you would have to repartition data based on the specific column in your dataframe and then it would partition your data based on the values of that column and make sure that same column values are there in the same partition
When you repartition the data to 2 partitions then it would shuffle the data between all the executors and then break the dat into 2 partitions and then that data could be on any of the executors and other executors would be empty or idle in that case.
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
you would have 20 tasks for that specific stage and it wont remain same for all the stages as stage gets created when there is data shuffle that needs to happen. If there is no shuffle happening based on the code that you might have written it would just create a single stage with 20 tasks for sure.
Q - Are these tasks performed by individual executors? Yes
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors? Yes

What is the relationship between tasks and partitions?

Can I say?
The number of the Spark tasks equal to the number of the Spark partitions?
The executor runs once (batch inside of executor) is equal to one task?
Every task produce only a partition?
(duplicate of 1.)
The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:
The number of Executor Instances (configuration)
The Number of Cores per Executor (configuration)
The Number of Partitions being used (coded)
Actual parallelism is the smaller of
executors * cores - which gives the amount of slots available to run tasks
partitions - each partition will translate to a task whenever a slot opens up.
Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.
You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.
Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².
So -
Is the number of the Spark tasks equal to the number of the Spark partitions?
No (see previous).
The executor runs once (batch inside of executor) is equal to one task?
An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).
Every task produce only a partition?
For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.
The number of the Spark tasks equal to the number of the Spark partitions?
Same as (1.)
(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).
(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.
Partitions are a feature of RDD and are only available at design time (before an action is called).
Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.
Is the number of the Spark tasks equal to the number of the Spark partitions?
Yes.
The executor runs once (batch inside of executor) is equal to one task?
That recursively uses "executor" and does not make much sense to me.
Every task produce only a partition?
Almost.
Every task produce an output of executing the code (it was created for) for the data in a partition.
The number of the Spark tasks equal to the number of the Spark partitions?
Almost.
The number of the Spark tasks in a single stage equals to the number of RDD partitions.
1.The number of the Spark tasks equal to the number of the Spark partitions?
Yes.
Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.
With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).
A good rule is that the number of partitions should be larger than the number of executors on your cluster
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.
2.The executor runs once (batch inside of executor) is equal to one task?
Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.
3.Every task produce only a partition?
It depend on the transformation.
Spark has Wide transformations and Narrow Transformation.
Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk
Narrow Transformation: Which input partition will contribute to only one output partition.
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.
Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

Resources