Can someone clarify what is the difference & similarities between RDD Lineage & DAG (Direct Acyclic graphs)?
DAG (direct acyclic graph) is the representation of the way Spark will execute your program - each vertex on that graph is a separate operation and edges represent dependencies of each operation. Your program (thus DAG that represents it) may operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion of a DAG (one or more operations) that lead to the creation of that particular RDD.
So, one DAG (one Spark program) might create multiple RDDs, and each RDD will have its lineage (i.e that path in your DAG that lead to that RDD). If some partitions of your RDD got corrupted or lost, then Spark may rerun that part of the DAG that leads to the creation of those partitions.
If the sole purpose of your Spark program is to create only one RDD and it's the last step, then the whole DAG is a lineage of that RDD.
You can find out more here - https://data-flair.training/blogs/rdd-lineage/
In Simple Words
Lineage: Logical plan to derive one RDD from other, it is the result of a transformation.
DAG: Physical plan which will be executed as a result of an action on RDD
Related
Is there a defined standard for effective memory management in Spark
What if I end up creating a couple of DataFrames or RDDs and then keep on reducing that data with joins and aggregations??
Will these DataFrames or RDDs will still be holding resources until the session or job is complete??
No there is not. The lifetime of the main entity in Spark which is the RDD is defined via its lineage. When the your job makes a call to an action then the whole DAG will start getting executed. If the job was executed successfully Spark will release all reserved resources otherwise will try to re-execute the tasks that failed and reconstructing the lost RDDs based on its lineage.
Please check the following resources to get familiar with these concepts:
What is Lineage In Spark?
What is the difference between RDD Lineage Graph and Directed Acyclic Graph (DAG) in Spark?
How lineage helps to recompute data?
For example, I'm having several nodes computing data for 30 minutes each. If one fails after 15 minutes, can we recompute data processed in 15 minutes again using lineage without giving 15 minutes again?
Everything to understand about lineage is in the definition of RDD.
So let's review that :
RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure
So there is mainly 2 things to understand:
How does lineage get passed down in RDDs?
How does Spark work internally?
Unfortunately, these topics are quite long to discuss in a single answer. I recommend you take some time reading them along with this following article about Data Lineage.
And now to answer your question and doubts:
If an executor fails computing your data, after 15 minutes, it will go back to your last checkpoint, whether it's from the source or cache in memory and/or on disk.
Thus, it will not save you those 15 minutes that you have mentioned!
When a transformation (map or filter etc.) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation. A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately, it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it. RDD lineage information can also be printed by using the command filteredRdd.toDebugString (filteredRdd is the RDD here). Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:
In Spark, Lineage Graph is a dependencies graph in between existing RDD and new RDD.
It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.
Source: What is Lineage Graph
DEF: The Spark lineage graph is the set of dependencies between
RDDs
•
Lineage graphs are maintained for each Spark application
separately
•
The lineage graph is used to re computer RDDs on demand and to
recover lost data if parts of a persisted RDD are lost
•
Note: be careful and do not confuse the lineage graph with the
Actions force the evaluation of all (upstream)
transformations in the lineage graph of the RDD they are
called on
When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when is the lineage graph generated? is it generated before the DAG of Spark tasks?
An RDD can depend on zero or more other RDDs. For example when you say x = y.map(...), x will depend on y. These dependency relationships can be thought of as a graph.
You can call this graph a lineage graph, as it represents the derivation of each RDD. It is also necessarily a DAG, since a loop is impossible to be present in it.
Narrow dependencies, where a shuffle is not required (think map and filter) can be collapsed into a single stage. Stages are a unit of execution, and they are generated by the DAGScheduler from the graph of RDD dependencies. Stages also depend on each other. The DAGScheduler builds and uses this dependency graph (which is also necessarily a DAG) to schedule the stages.
I know that Spark can be operated using Scala, Python and Java. Also, that RDDs are used to store data.
But please explain, what's the architecture of Spark and how does it work internally.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution,
At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. E.g. many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker/Slave executes the tasks.
Let's come to how Spark builds the DAG.
At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.
Narrow transformation - doesn't require the data to be shuffled across the partitions. For example, map, filter, etc.
Wide transformation - requires the data to be shuffled, for example, reduceByKey, etc.
Let's take an example of counting how many log messages appear at each level of severity.
Following is the log file that starts with the severity level:
INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message
and create the following Scala code to extract the same:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method. For example, executing toDebugString() on splitedLines RDD, will output the following:
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
The first line (from bottom) shows the input RDD. We created this RDD by calling sc.textFile(). See below more diagrammatic view of the DAG graph created from the given RDD.
Once the DAG is built, Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create a two-stage execution as follows:
The DAG scheduler then submits the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 sets of tasks created and submitted in parallel provided if there are enough slaves/cores. The below diagram illustrates this in bit more detail:
For more detailed information I suggest you to go through the following YouTube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.
Advanced Apache Spark- Sameer Farooqui (Databricks)
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Introduction to AmpLab Spark Internals
The diagram below shows how Apache Spark internally working:
Here are some JARGONS from Apache Spark i will be using.
Job:- A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
Stages:-Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.
Tasks:- Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).
DAG:- DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Executor:- The process responsible for executing a task.
Driver:- The program/process responsible for running the Job over the Spark Engine
Master:- The machine on which the Driver program runs
Slave:- The machine on which the Executor program runs
All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. For instance let’s assume that you have to submit a Spark job which contains a map operation followed by a filter operation. Spark DAG optimizer would rearrange the order of these operators, as filtering would reduce the number of records to undergo map operation.
Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
As you enter your code in spark console(creating RDD's and applying operators), Spark creates a operator graph.
When the user runs an action(like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager.( Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies among stages.
The Worker executes the tasks on the Slave. A new JVM is started per JOB. The worker knows only about the code that is passed to it.
Spark caches the data to be processed, allowing it to me 100 times faster than hadoop. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed spark to grow exponentially, and in a little time many organisations are already using it in production.
The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the internal mechanics on Resilient Distributed Datasets with Directed Acyclic Graph seems lacking in this paper.
Should it be better learned by investigating the source code?
Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task.
At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker executes the tasks on the Slave.
Let's come to how Spark builds the DAG.
At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.
Narrow transformation - doesn't require the data to be shuffled across the partitions. for example, Map, filter etc..
wide transformation - requires the data to be shuffled for example, reduceByKey etc..
Let's take an example of counting how many log messages appear at each level of severity,
Following is the log file that starts with the severity level,
INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message
and create the following scala code to extract the same,
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.
To display the lineage of an RDD, Spark provides a debug method toDebugString(). For example executing toDebugString() on the splitedLines RDD, will output the following:
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
The first line (from the bottom) shows the input RDD. We created this RDD by calling sc.textFile(). Below is the more diagrammatic view of the DAG graph created from the given RDD.
Once the DAG is build, the Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create two stage execution as follows:
The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 set of tasks created and submitted in parallel provided there are enough slaves/cores. Below diagram illustrates this in more detail:
For more detailed information i suggest you to go through the following youtube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.
Advanced Apache Spark- Sameer Farooqui (Databricks)
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Introduction to AmpLab Spark Internals
Beginning Spark 1.4 visualization of data has been added through the following three components where it also provide a clear graphical representation of DAG.
Timeline view of Spark events
Execution DAG
Visualization of Spark Streaming statistics
Refer to link for more information.