SparkSQL Number of Tasks - apache-spark

I have a Spark Standalone Cluster (which consists of two Workers with 2 cores each). I run an SQLQuery which joins 2 dataframes and shows the result. I have some questions regarding the above simle example.
val df1 = sc.read.text(fn1).toDF()
val df2 = sc.read.text(fn2).toDF()
df1.createOrReplaceTempView("v1")
df2.createOrReplaceTempView("v2")
val df_join = sc.sql("SELECT * FROM v1,v2 WHERE v1.value=v2.value AND v2.value<1500").show()
DAG Scheduler - Number of Tasks
From what i've understood so far when i spark-submit the application, a SparkContext is spawn for the handling of the Job(where job is the printing of result rows). SparkContext creates a Task Scheduler instance which then creates a DAGScheduler. Through a simple event mechanism, the DAGScheduler handles the job for execution(handleJobSubmitted function from the code). SparkSQL query has been transformed into a physical execution plan(through Catalyst Optimizer), and then to an RDD-Graph(with toRdd function). DagScheduler receives the RDD-Graph and recursively creates all the stages.
I do not understand how it finds the Number of Tasks(before the execution of any stage) in the last stage,keeping in mind that the result stage is the one that performs the join(and prints the results). The number of data(and the rdds and the number of their partitions, which define the number of tasks) we have is unknown until the parent stages have ended their execution.
Parallel Execution of Stages
Each one of the two first stages is independent of the other, as it loads data from different files. I have read many posts that say that Stages that do not have dependencies between them MAY be executed in parallel from the cluster. What is the condition that implies that independent stages's tasks are executed in parallel?
Task Dependencies
Finally, i've read that Task Scheduler does not know about Stage Dependencies. If i keep in mind that each Stage in Spark is a TakSet( aka a set of non dependent tasks, each task with same functionality packed up with different partition of data), then TaskScheduler does not know as well the dependencies between tasks of different Stages. As a result, how and when a task knows the data on which it'll execute a function?
If for example, the task knows apriori where to look for its input data, then it could be launched as soon as they become available.

Related

After modification in dataframe, how many stages and tasks will create

When a Data frame is split and again joined with different columns, How many and how are stages created in DAG and how tasks are created in stages.
4. How DAG works in Spark?
The interpreter is the first layer, using a Scala interpreter, Spark interprets the code with some modifications.
Spark creates an operator graph when you enter your code in Spark console.
When we call an Action on Spark RDD at a high level, Spark submits the operator graph to the DAG Scheduler.
Divide the operators into stages of the task in the DAG Scheduler. A stage contains task based on the partition of the input data. The DAG scheduler pipelines operators together. For example, map operators schedule in a single stage.
The stages pass on to the Task Scheduler. It launches task through cluster manager. The dependencies of stages are unknown to the task scheduler.
The Workers execute the task on the slave.
You can get more information from the link: https://data-flair.training/blogs/dag-in-apache-spark/

Do stages in an application run parallel in spark?

I have a doubt that, how do stages execute in a spark application. Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?
Check the entities(stages, partitions) in this pic:
pic credits
Does stages in a job(spark application ?) run parallel in spark?
Yes, they can be executed in parallel if there is no sequential dependency.
Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.
Is there any consistency in execution of stages that can be defined by
programmer or will it derived by spark engine?
Stage boundary is defined by when data shuffling happens among partitions. (check pink lines in pic)
How do stages execute in a Spark job
Stages of a job can run in parallel if there is no dependencies among them.
In Spark, stages are split by boundries. You have a shuffle stage, which is a boundary stage where transformations are split at, i.e. reduceByKey, and you have a result stage, which are stages that are bound to yield a result without causing a shuffle, i.e. a map operation:
(Picture provided by Cloudera)
Since groupByKey is a shuffle stage, you see the split in pink boxes which marks a boundary.
Internally, a stage is further divided into tasks. e.g in the picture above, the first row which does textFile -> map -> filter, can be split into three tasks, one for each transformation.
When one transformations output is another transformations input, we need the serial execution. But, if stages are unrelated, i.e hadoopFile -> groupByKey -> map, they can run in parallel. Once they declare a dependency between them from that stage on they will continue execution serially.

Where is the spark job of transformation and action done?

I have been using Spark + Python to finish some works, it's great, but I have a question in my mind:
Where is the spark job of transformation and action done?
Is transformation job done in Spark Master (or Driver) while action job is done in Workers (Executors), or both of them are done in Workers (Executors)
Workers (aka slaves) are running Spark instances where executors live
to execute tasks.
Transformations are performed at the worker, when the action method is called the computed data is brought back to the driver.
An application in Spark is executed in three steps:
1.Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire computation.
2.Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.
3.Based on the plan, schedule and execute tasks on workers.
Transformations run at executors.
Actions run at executors and driver. Most of the work is still happening in the executors but the final steps like reducing outputs is executed in the driver.
When any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The tasks(transformation) executes on the Workers(Executors)
and when action(take/collect) is called it brings back the data at the
Driver.

How are stages split into tasks in Spark?

Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor.
Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Are the tasks determined by the transformations and actions or can be multiple transformations/actions be in a task?
Are the tasks determined by the partition (e.g. one task per per stage per partition).
Are the tasks determined by the nodes (e.g. one task per stage per node)?
What I think (only partial answer, even if right)
In https://0x0fff.com/spark-architecture-shuffle, the shuffle is explained with the image
and I get the impression that the rule is
each stage is split into #number-of-partitions tasks, with no regard for the number of nodes
For my first image I'd say that I'd have 3 map tasks and 3 reduce tasks.
For the image from 0x0fff, I'd say there are 8 map tasks and 3 reduce tasks (assuming that there are only three orange and three dark green files).
Open questions in any case
Is that correct? But even if that is correct, my questions above are not all answered, because it is still open, whether multiple operations (e.g. multiple maps) are within one task or are separated into one tasks per operation.
What others say
What is a task in Spark? How does the Spark worker execute the jar file? and How does the Apache Spark scheduler split files into tasks? are similar, but I did not feel that my question was answered clearly there.
You have a pretty nice outline here. To answer your questions
A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations - e.g. blocks in HDFS or directories/volumes for a local file system.
Note that the submission of Stages is driven by the DAG Scheduler. This means that stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched.
We can see that in action in the following toy example in which we do the following types of operations:
load two datasources
perform some map operation on both of the data sources separately
join them
perform some map and filter operations on the result
save the result
So then how many stages will we end up with?
1 stage each for loading the two datasources in parallel = 2 stages
A third stage representing the join that is dependent on the other two stages
Note: all of the follow-on operations working on the joined data may be performed in the same stage because they must happen sequentially. There is no benefit to launching additional stages because they can not start work until the prior operation were completed.
Here is that toy program
val sfi = sc.textFile("/data/blah/input").map{ x => val xi = x.toInt; (xi,xi*xi) }
val sp = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spj = sfi.join(sp)
val sm = spj.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile("/data/blah/out")
And here is the DAG of the result
Now: how many tasks ? The number of tasks should be equal to
Sum of (Stage * #Partitions in the stage)
This might help you better understand different pieces:
Stage: is a collection of tasks. Same process running against
different subsets of data (partitions).
Task: represents a unit of
work on a partition of a distributed dataset. So in each stage,
number-of-tasks = number-of-partitions, or as you said "one task per
stage per partition”.
Each executer runs on one yarn container, and
each container resides on one node.
Each stage utilizes multiple executers, each executer is allocated multiple vcores.
Each vcore can execute exactly one task at a time
So at any stage, multiple tasks could be executed in parallel. number-of-tasks running = number-of-vcores being used.
If I understand correctly there are 2 ( related ) things that confuse you:
1) What determines the content of a task?
2) What determines the number of tasks to be executed?
Spark's engine "glues" together simple operations on consecutive rdds, for example:
rdd1 = sc.textFile( ... )
rdd2 = rdd1.filter( ... )
rdd3 = rdd2.map( ... )
rdd3RowCount = rdd3.count
so when rdd3 is (lazily) computed, spark will generate a task per partition of rdd1 and each task will execute both the filter and the map per line to result in rdd3.
The number of tasks is determined by the number of partitions. Every RDD has a defined number of partitions. For a source RDD that is read from HDFS ( using sc.textFile( ... ) for example ) the number of partitions is the number of splits generated by the input format. Some operations on RDD(s) can result in an RDD with a different number of partitions:
rdd2 = rdd1.repartition( 1000 ) will result in rdd2 having 1000 partitions ( regardless of how many partitions rdd1 had ).
Another example is joins:
rdd3 = rdd1.join( rdd2 , numPartitions = 1000 ) will result in rdd3 having 1000 partitions ( regardless of partitions number of rdd1 and rdd2 ).
( Most ) operations that change the number of partitions involve a shuffle, When we do for example:
rdd2 = rdd1.repartition( 1000 )
what actually happens is the task on each partition of rdd1 needs to produce an end-output that can be read by the following stage so to make rdd2 have exactly 1000 partitions ( How they do it? Hash or Sort ). Tasks on this side are sometimes referred to as "Map ( side ) tasks".
A task that will later run on rdd2 will act on one partition ( of rdd2! ) and would have to figure out how to read/combine the map-side outputs relevant to that partition. Tasks on this side are sometimes referred to as "Reduce ( side ) tasks".
The 2 questions are related: the number of tasks in a stage is the number of partitions ( common to the consecutive rdds "glued" together ) and the number of partitions of an rdd can change between stages ( by specifying the number of partitions to some shuffle causing operation for example ).
Once the execution of a stage commences, its tasks can occupy task slots. The number of concurrent task-slots is numExecutors * ExecutorCores. In general, these can be occupied by tasks from different, non-dependent stages.

How Spark works internally

I know that Spark can be operated using Scala, Python and Java. Also, that RDDs are used to store data.
But please explain, what's the architecture of Spark and how does it work internally.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution,
At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. E.g. many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker/Slave executes the tasks.
Let's come to how Spark builds the DAG.
At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.
Narrow transformation - doesn't require the data to be shuffled across the partitions. For example, map, filter, etc.
Wide transformation - requires the data to be shuffled, for example, reduceByKey, etc.
Let's take an example of counting how many log messages appear at each level of severity.
Following is the log file that starts with the severity level:
INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message
and create the following Scala code to extract the same:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method. For example, executing toDebugString() on splitedLines RDD, will output the following:
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
The first line (from bottom) shows the input RDD. We created this RDD by calling sc.textFile(). See below more diagrammatic view of the DAG graph created from the given RDD.
Once the DAG is built, Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create a two-stage execution as follows:
The DAG scheduler then submits the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 sets of tasks created and submitted in parallel provided if there are enough slaves/cores. The below diagram illustrates this in bit more detail:
For more detailed information I suggest you to go through the following YouTube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.
Advanced Apache Spark- Sameer Farooqui (Databricks)
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Introduction to AmpLab Spark Internals
The diagram below shows how Apache Spark internally working:
Here are some JARGONS from Apache Spark i will be using.
Job:- A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
Stages:-Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.
Tasks:- Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).
DAG:- DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Executor:- The process responsible for executing a task.
Driver:- The program/process responsible for running the Job over the Spark Engine
Master:- The machine on which the Driver program runs
Slave:- The machine on which the Executor program runs
All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. For instance let’s assume that you have to submit a Spark job which contains a map operation followed by a filter operation. Spark DAG optimizer would rearrange the order of these operators, as filtering would reduce the number of records to undergo map operation.
Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
As you enter your code in spark console(creating RDD's and applying operators), Spark creates a operator graph.
When the user runs an action(like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager.( Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies among stages.
The Worker executes the tasks on the Slave. A new JVM is started per JOB. The worker knows only about the code that is passed to it.
Spark caches the data to be processed, allowing it to me 100 times faster than hadoop. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed spark to grow exponentially, and in a little time many organisations are already using it in production.

Resources