What does Spark actually do before action is called? - apache-spark

Sparks transformations have to be triggered by calling actions. What does Spark exactly do if no action is called? And which parts or processes are involved in processing a lazy operation (e.g. transformation) before the triggering of its execution?

tl;dr Spark does almost nothing (given what it does in general).
Applying transformations creates a RDD lineage, i.e. a DAG of RDDs. That's how an RDD can meet the R in its name - being resilient and be able to recover in case of missing map outputs. No execution happens on executors, no serialization, sending over the wire, or similar network-related activity. All it does is to create new RDDs out of existing ones building a graph of RDDs.
Every transformation call returns a new RDD. You start with a SparkContext and build a "pipeline" applying transformations.
It's only when an action is called to submit a job when DAGScheduler transforms RDDs into stages of TaskSets/TaskSetManagers that in turn are going to be executed as parallel tasks on executors.
p.s. A couple of transformations, however, trigger a job like sortBy or zipWithIndex. See https://issues.apache.org/jira/browse/SPARK-1021.

My understanding is that before any action is called, Spark is only building the DAG.
Its when you call an Action, it executes the DAG which it has been building so far.
So if you don't call an action, no processing is done. its only building the DAG.

Related

how DAG is created in spark

what is the relation and which one creates other in Spark? RDD lineage DAG,DAG Scheduler,Stages and Task
​Hi Friends,
I am confused with the creation of RDD lineage,DAG,DAG Scheduler,Stage and Task.
Please validate my understanding
1) After we submit a job before an action is called...what ever transformation are put in the code before an action is called on RDD ..that RDD will have history of lineage..that is which is the parent RDD and what are transformation has occurred to create this RDD and its dependency..this is called lineage (logical execution plan)
2) When an action is called on RDD,the lineage will be converted into DAG(Physical execution plan).
3)DAG(Physical execution plan) will be submitted to DAG Scheduler which in turn will split the DAG into Stages
4)Each stage will have list of task
5)Each task will run in a executor (One executor will run one task on one partition?)
Also I want to understand where the catalyst optimizer and Tungsten encoder will come into plan?
Is it the responsibility of Catalyst optimizer will convert the RDD lineage into the best optimized execution plan as DAG?
Is it the responsbility of Tungsten encode will convert the Scala code into bytecode?
Please help me to understand the above

Same set of Tasks are repeated in multiple stages in a Spark Job

A group of tasks consists of filters & maps appears in DAG visualization of multiple stages. Does this mean the same transformations are recomputed in all the stages? If so how to resolve this?
For every action performed on a dataframe, all transformations will be recomputed. This is due to the transformations not being computed until an action is performed.
If you only have a single action then there is nothing you can do, however, in the case of multiple actions after each other, then cache() can be used after the last transformation. By using this method Spark will save the dataframe to RAM after the first computation, making subsequent actions much faster.

What is the behavior of transformations and actions in Spark?

We're performing some tests to evaluate the behavior of transformations and actions in Spark with Spark SQL. In our tests, first we conceive a simple dataflow with 2 transformations and 1 action:
LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2)
The execution time for this first dataflow was 10 seconds. Next, we added another action to our dataflow:
LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2) > COUNT(df_2)
Analyzing the second version of the dataflow, since all transformation are lazy and re-executed for each action (according to the documentation), when executing the second count, it should require the execution of the two previous transformations (LOAD and SELECT ALL). Thus, we expected that when executing this second version of our dataflow, the time would be around 20 seconds. However, the execution time was 11 seconds. Apparently, the results of the transformations required by the first count were cached by Spark for the second count.
Please, do you guys know what is happening?
Take a look at your jobs, you may see skipped stages which is a good thing. Spark recognizes that it still has the shuffle output from the previous job and will reuse it rather than starting from the source data and re-shuffle the full dataset.
It is the Spark DAG scheduler which recolonizes that there is future use of data after it get it from Action.A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations.When the driver runs, it converts this logical graph into a physical execution plan.
Actions force translation of the DAG to an execution plan
When you call an action on an RDD it must be computed.In your Case you are just doing an action and after that doing another action on top of that. This requires computing its parent RDDs as well. Spark’s scheduler submits a job to compute all needed RDDs. That job will have one or more stages, which are parallel waves of computation composed of tasks. Each stage will correspond to one or more RDDs in the DAG. A single stage can correspond to multiple RDDs due to pipelining.
Spark Visualization
DAG

Why Job entry shows up in Spark UI for RDD with only transformations and no actions

I have a text file as the source:-
key1,value1
key2,value2
key3,value3
key4,value4
I define the following RDD in Scala shell:-
val rdd=sc.textFile("sample.txt").map(_.split(",")).map(x=>( x(0),x(1) )).sortByKey()
As you can see, there are only transformations here and no actions. As per Spark's rules of Lazy Evaluation it should not trigger any job. But this declaration itself is triggering a job which I can confirm from a new job entry being made in Spark UI. Interestingly this is somehow being caused by sortByKey operation. I understand that sortByKey would cause shuffling across partitions. But that too should happen only when an action is called ultimately. Another mystery is that if I replace sortByKey with groupByKey, it does not trigger a job even though both these operations would cause shuffling. So two key concerns here.
Why the transformation sortByKey is causing job trigger?
Why groupByKey does not trigger a job and only sortByKey does, when both of these transformations cause shuffling ?
Because this is a bug...simple as that :)

How Spark works internally

I know that Spark can be operated using Scala, Python and Java. Also, that RDDs are used to store data.
But please explain, what's the architecture of Spark and how does it work internally.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution,
At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. E.g. many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker/Slave executes the tasks.
Let's come to how Spark builds the DAG.
At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.
Narrow transformation - doesn't require the data to be shuffled across the partitions. For example, map, filter, etc.
Wide transformation - requires the data to be shuffled, for example, reduceByKey, etc.
Let's take an example of counting how many log messages appear at each level of severity.
Following is the log file that starts with the severity level:
INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message
and create the following Scala code to extract the same:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method. For example, executing toDebugString() on splitedLines RDD, will output the following:
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
The first line (from bottom) shows the input RDD. We created this RDD by calling sc.textFile(). See below more diagrammatic view of the DAG graph created from the given RDD.
Once the DAG is built, Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create a two-stage execution as follows:
The DAG scheduler then submits the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 sets of tasks created and submitted in parallel provided if there are enough slaves/cores. The below diagram illustrates this in bit more detail:
For more detailed information I suggest you to go through the following YouTube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.
Advanced Apache Spark- Sameer Farooqui (Databricks)
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Introduction to AmpLab Spark Internals
The diagram below shows how Apache Spark internally working:
Here are some JARGONS from Apache Spark i will be using.
Job:- A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
Stages:-Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.
Tasks:- Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).
DAG:- DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Executor:- The process responsible for executing a task.
Driver:- The program/process responsible for running the Job over the Spark Engine
Master:- The machine on which the Driver program runs
Slave:- The machine on which the Executor program runs
All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. For instance let’s assume that you have to submit a Spark job which contains a map operation followed by a filter operation. Spark DAG optimizer would rearrange the order of these operators, as filtering would reduce the number of records to undergo map operation.
Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
As you enter your code in spark console(creating RDD's and applying operators), Spark creates a operator graph.
When the user runs an action(like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager.( Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies among stages.
The Worker executes the tasks on the Slave. A new JVM is started per JOB. The worker knows only about the code that is passed to it.
Spark caches the data to be processed, allowing it to me 100 times faster than hadoop. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed spark to grow exponentially, and in a little time many organisations are already using it in production.

Resources