What is the relationship between a Node, Worker, Executor, Task and Partition - apache-spark

I am trying to understand the relationship between different components and elements in the Spark architecture but am unable to get a grip on it. Can someone please validate my assumptions and correct me where I am wrong.
My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers.
Q - Can a node have multiple drivers (if I have multiple applications)?
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
Q. What metric determines the number of executors per worker?
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
Q. What is the relationship between a core and an executor?
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
For e.g., if I have a cluster with 2 nodes, 10 executors (5 executors in each node) and a dataframe with 20 partitions, I'm assuming I would have 2 partitions in each executor or is there a chance that partition distribution could be skewed? What would I need to do to ensure that all my partitions that have a certain partitioning key get co-located within the same worker so there is minimum network transfer when these partitions have to work together to, say, perform an aggregation or a join?
Q - What happens when a repartition() is performed. For e.g., if I have 20 partitions across 10 executors (say, 2 partitions in each) and I repartition(2). I will now have only 2 portions of data which I assume would be resting in a couple of executors. What happens to the remaining executors?
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
Q - Are these tasks performed by individual executors?
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors?
Thanks in advance!

My understanding - A node is the actual physical machine. One node can contain the main driver while others will contain the workers. (This is Correct as a starting Point)
Q - Can a node have multiple drivers (if I have multiple applications)?
Yes Because driver is just a process that gets created based on the program that you might have written. And you can have multiple process running on the same node.
My understanding - A worker is a process within a node. There can be multiple workers within each node though not recommended.
your understanding here seems wrong because worker is actually a node or machine. Either you say it worker or worker node both are same
My understanding - An executor is a sub-process(?) within a worker process. Each worker can have multiple executors.
An executor is a process inside the worker node and a single worker node can have multiple executors
Q. What metric determines the number of executors per worker?
configuration(Number of cores and memory) of your worker node decides what is the max executors it can run on any specific worker node.
Q. Is the idea of a JVM associated with an executor process or at a higher "worker" level?
It is associated with the executor process. Spark executor is a single JVM instance on a node that serves a single spark application
Q. What is the relationship between a core and an executor?
Core property controls the number of concurrent tasks an executor can run. For example if you request 2 executor each with 2 cores then you can run 4 concurrent tasks at the same time during your job execution.
Q - Can RAM and HD be allocated at an executor level?
For e.g., if I have a worker node with 100GB of RAM and 5 TB HD, can I allocate 20 GB RAM and 1 TB HDD per executor for that worker?
Generally spark perform all its computation in memory. RAM is allocated at the executor level and HD would be allocated at the Worker node level only. Spark would just spill the data to the disk only when it does not fit in memory
My understanding - A partition is a portion of the actual data. This split could happen using hashing, round robin or range.
Q - What determines the location of these data partitions?
These partitions could be anywhere and might not be equally distributed in most of the cases.It could happen some of the executors does not have a single partition and other executors have more than 2 partitions.
In order to have colocated partitions or partitions that have same keys you would have to repartition data based on the specific column in your dataframe and then it would partition your data based on the values of that column and make sure that same column values are there in the same partition
When you repartition the data to 2 partitions then it would shuffle the data between all the executors and then break the dat into 2 partitions and then that data could be on any of the executors and other executors would be empty or idle in that case.
Assumption - A task is the lowest unit of work that performs the actual ask. The number of tasks dependent on the number of partitions. So, if there are 20 partitions, I would have 20 tasks in each stage.
you would have 20 tasks for that specific stage and it wont remain same for all the stages as stage gets created when there is data shuffle that needs to happen. If there is no shuffle happening based on the code that you might have written it would just create a single stage with 20 tasks for sure.
Q - Are these tasks performed by individual executors? Yes
Q - If I have less executors (say, 10) than the partitions (say, 20), does it mean that only 10 tasks will be executed in parallel at any point? is the degree of parallelism constrained by the number of executors? Yes

Related

Apache Spark: How many partitions can a executor hold in spark.? How are the partitions distributed (mechanism) among the executors?

I am interested to know the following nitty gritties of spark parallelism and Partitioning
How many partitions can a executor hold in spark?
How are the partitions distributed (mechanism) among the executors?
How to set the size of the partition. Would like to know the relevant the config parameter.
Does executor store all the partitions in memory? If not when spilled to disk does it spill entire partition to disk or a part of partition to disk?
5 When there are 2 cores per executor but there are 5 partition in that executor then
Not quite the right way to look at it. An Executor holds nothing, it just does work.
A Partition is processed by a Core that has been assigned to an Executor. An Executor typically has 1 core but can have more than 1 such Core.
An App has Actions that translate to 1 or more Jobs.
A Job has Stages (based on Shuffle Boundaries).
Stages have Tasks, the number of these depends on number of Partitions.
Parallel processing of the Partitions depends on number of Cores allocated to Executors.
Spark is scalable in terms of Cores, Memory and Disk. The latter two in relation to your questions means that if the Partitions cannot all fit into Memory on the Worker for your Job, then that Partition or more will spill in its entirety to Disk.

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

What is the relationship between tasks and partitions?

Can I say?
The number of the Spark tasks equal to the number of the Spark partitions?
The executor runs once (batch inside of executor) is equal to one task?
Every task produce only a partition?
(duplicate of 1.)
The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:
The number of Executor Instances (configuration)
The Number of Cores per Executor (configuration)
The Number of Partitions being used (coded)
Actual parallelism is the smaller of
executors * cores - which gives the amount of slots available to run tasks
partitions - each partition will translate to a task whenever a slot opens up.
Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.
You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.
Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².
So -
Is the number of the Spark tasks equal to the number of the Spark partitions?
No (see previous).
The executor runs once (batch inside of executor) is equal to one task?
An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).
Every task produce only a partition?
For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.
The number of the Spark tasks equal to the number of the Spark partitions?
Same as (1.)
(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).
(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.
Partitions are a feature of RDD and are only available at design time (before an action is called).
Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.
Is the number of the Spark tasks equal to the number of the Spark partitions?
Yes.
The executor runs once (batch inside of executor) is equal to one task?
That recursively uses "executor" and does not make much sense to me.
Every task produce only a partition?
Almost.
Every task produce an output of executing the code (it was created for) for the data in a partition.
The number of the Spark tasks equal to the number of the Spark partitions?
Almost.
The number of the Spark tasks in a single stage equals to the number of RDD partitions.
1.The number of the Spark tasks equal to the number of the Spark partitions?
Yes.
Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.
With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).
A good rule is that the number of partitions should be larger than the number of executors on your cluster
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.
2.The executor runs once (batch inside of executor) is equal to one task?
Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.
3.Every task produce only a partition?
It depend on the transformation.
Spark has Wide transformations and Narrow Transformation.
Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk
Narrow Transformation: Which input partition will contribute to only one output partition.
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.
Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

Spark performance tuning - number of executors vs number for cores

I have two questions around performance tuning in Spark:
I understand one of the key things for controlling parallelism in the spark job is the number of partitions that exist in the RDD that is being processed, and then controlling the executors and cores processing these partitions. Can I assume this to be true:
# of executors * # of executor cores shoud be <= # of partitions. i.e to say one partition is always processed in one core of one executor. There is no point having more executors*cores than the number of partitions
I understand that having a high number of cores per executor can have a -ve impact on things like HDFS writes, but here's my second question, purely from a data processing point of view what is the difference between the two? For e.g. if I have 10 node cluster what would be the difference between these two jobs (assuming there's ample memory per node to process everything):
5 executors * 2 executor cores
2 executors * 5 executor cores
Assuming there's infinite memory and CPU, from a performance point of view should we expect the above two to perform the same?
Most of the time using larger executors (more memory, more cores) are better. One: larger executor with large memory can easily support broadcast joins and do away with shuffle. Second: since tasks are not created equal, statistically larger executors have better chance of surviving OOM issues.
The only problem with large executors is GC pauses. G1GC helps.
In my experience, if I had a cluster with 10 nodes, I would go for 20 spark executors. The details of the job matter a lot, so some testing will help determine the optional configuration.

Resources