Use Spark internally Map-Reduce? - apache-spark

Is Spark using Map Reduce internally ? (his own map reduce)
The first time I hear somebody tell me, "Spark use map-reduce", I was so confused, I always learned that spark was the great adversary against Hadoop-Map Reduce.
After check in Google I just found a web-site that make some too short explanation about that : https://dzone.com/articles/how-does-spark-use-mapreduce
But the rest of Internet is about Spark vs Map Reduce.
Than somebody explain me that when spark make a RDD the data is split in different datasets and if you are using for example SPAR.SQL a query that should not be a map reduce like:
select student
from Table_students
where name = "Enrique"
Internally Spark is doing a map reduce to retrieve the Data( from the different datasets).
It´s that true ?
If I'm using Spark Mlib, to use machine learning, I always heard that machine learning is not compatible with map reduce because it need so many interactions and map reduce use batch processing..
In Spark Mlib, is Spark Internally using Map reduce too ?

Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. DAG is a strict generalization of MapReduce model.
This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.
So, Spark can write map-reduce program, but actually use DAG inside.
Reference:
Directed Acyclic Graph DAG in Apache Spark
What is Directed Acyclic Graph in Apache Spark?
What are the Apache Spark concepts around its DAGexecution engine, and its overall architecture?
How-to: Translate from MapReduce to Apache Spark

Related

What is the difference between Map Reduce and Spark about engine in Hive?

It looks like there are two ways to use spark as the backend engine for Hive.
The first one is directly using spark as the engine. Like this tutorial.
Another way is to use spark as the backend engine for MapReduce. Like this tutorial.
In the first tutorial, the hive.execution.engine is spark. And I cannot see hdfs involved.
In the second tutorial, the hive.execution.engine is still mr, but as there is no hadoop process, it looks like the backend of mr is spark.
Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr has been deprecated. But where is the hdfs involved?
I understood it differently.
Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.
But for a period now Spark can be used as execution engine for Spark.
https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.
Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.
Now what is DAG?
DAG is building logical dependencies before execution.(Think of it as a visual graph)
When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs.
DAG is build in Tez (right side of photo) but not in MapReduce (left side).
NOTE:
Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.
Reason 2:
Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge.
But in Apache Spark intermediate data is persist to memory which makes it faster.
Check this link for details

What does "cyclic data flow" mean in Apache Spark?

Spark is a DAG execution engine. Are not cyclic and DAG opposite concepts? It's surprising hard to find the answer to this apparent contradiction.
As you can see here: Understanding your Apache Spark Application Through Visualization, it is possible to visualize the execution DAG using the Spark UI. However, none of the examples in that page shows a cyclic data flow. In the following image you can see one of these examples.
Spark execution DAG example
Can these iterations (cyclic data flows) be outside the graph? I have read in MAPR that "Each Spark job creates a DAG of task stages to be performed on the cluster". Then, maybe the cyclic data flow occurs between DAGs (jobs).
Thank you.
Ok, it seems that it was a typo or something in the documentation. As of today, we can find this in the Spark homepage:
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.

Spark SQL: how does it map to RDD operations?

When I learn spark SQL, I have a question in my mind:
As said, the SQL execution result is SchemaRDD, but what happens behind the scene? How many transformations or actions in the optimized execution plan, which should be equivalent to plain RDD hand-written codes invoked?
If we write codes by hand instead of SQL, it may generate some intermediate RDDs, e.g. a series of map(), filter() operations upon the source RDD. But the SQL version would not generate intermediate RDDs, correct?
Depending on the SQL content, the generated VM byte codes also involves partitioning, shuffling, correct? But without intermediate RDDs, how could spark schedule and execute them on worker machines?
In fact, I still can not understand the relationship between the spark SQL and spark core. How they interact with each other?
To understand how SparkSQL or the dataframe/dataset DSL maps to RDD operations, look at the physical plan Spark generates using explain.
sql(/* your SQL here */).explain
myDataframe.explain
At the very core of Spark, RDD[_] is the underlying datatype that is manipulated using distributed operations. In Spark versions <= 1.6.x DataFrame is RDD[Row] and Dataset is separate. In Spark versions >= 2.x DataFrame becomes Dataset[Row]. That doesn't change the fact that underneath it all Spark uses RDD operations.
For a deeper dive into understanding Spark execution, read Understanding Spark Through Visualization.

How Spark works internally

I know that Spark can be operated using Scala, Python and Java. Also, that RDDs are used to store data.
But please explain, what's the architecture of Spark and how does it work internally.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution,
At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. E.g. many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker/Slave executes the tasks.
Let's come to how Spark builds the DAG.
At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.
Narrow transformation - doesn't require the data to be shuffled across the partitions. For example, map, filter, etc.
Wide transformation - requires the data to be shuffled, for example, reduceByKey, etc.
Let's take an example of counting how many log messages appear at each level of severity.
Following is the log file that starts with the severity level:
INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message
and create the following Scala code to extract the same:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method. For example, executing toDebugString() on splitedLines RDD, will output the following:
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
The first line (from bottom) shows the input RDD. We created this RDD by calling sc.textFile(). See below more diagrammatic view of the DAG graph created from the given RDD.
Once the DAG is built, Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create a two-stage execution as follows:
The DAG scheduler then submits the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 sets of tasks created and submitted in parallel provided if there are enough slaves/cores. The below diagram illustrates this in bit more detail:
For more detailed information I suggest you to go through the following YouTube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.
Advanced Apache Spark- Sameer Farooqui (Databricks)
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
Introduction to AmpLab Spark Internals
The diagram below shows how Apache Spark internally working:
Here are some JARGONS from Apache Spark i will be using.
Job:- A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
Stages:-Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.
Tasks:- Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).
DAG:- DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Executor:- The process responsible for executing a task.
Driver:- The program/process responsible for running the Job over the Spark Engine
Master:- The machine on which the Driver program runs
Slave:- The machine on which the Executor program runs
All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. For instance let’s assume that you have to submit a Spark job which contains a map operation followed by a filter operation. Spark DAG optimizer would rearrange the order of these operators, as filtering would reduce the number of records to undergo map operation.
Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
As you enter your code in spark console(creating RDD's and applying operators), Spark creates a operator graph.
When the user runs an action(like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager.( Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies among stages.
The Worker executes the tasks on the Slave. A new JVM is started per JOB. The worker knows only about the code that is passed to it.
Spark caches the data to be processed, allowing it to me 100 times faster than hadoop. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed spark to grow exponentially, and in a little time many organisations are already using it in production.

How mapping/reducing phases work in Spark

I'm coming from a MapReduce background and I'm quite new to Spark. I could not find an article explaining the architectural difference between MapReduce and Spark. My understanding so far is the only difference the MapReduce and Spark have is the notion of 'in-memory' processing. That is, the Spark has mapping/reducing phase and they might run on two different nodes within the cluster. Pairs with the same keys are transferred to the same reducer and there is a shuffling phase involved. Am I correct? or there is some difference in the way mapping and reducing stages are done and...
I think it's directly on point, so I don't mind pointing you to a blog post I wrote:
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
Spark is a large superset of MapReduce, in the sense that you can express MapReduce with Spark operators, but a lot of other things too. It has a large set of small operations from which you construct pipelines. So there's not a 1:1 mapping, but, you can identify how a lot of MapReduce elements correspond to Spark. Or: MapReduce actually gives you two operations that do a lot more than 'map' and 'reduce', which may not have been obvious so far.

Resources