Can Someone explain which component of the spark architecture convert Spark Application to DAG?
Can Someone help me with where can I find the complete internal working of Spark architecture in absolute ultra depth?
I am trying to understand Apache Spark Architecture in Depth.
At a very first stage, I understood,
Spark Application is converted to DAG(Directed Acyclic Graph.) This DAG is scheduled by DAG Schedular to be executed according to the execution plan prepared by Spark Physical Execution Engine(Tungsten).
That would be the Catalyst Optimizer. This article discusses the Catalyst Optimizer in quite some detail.
Don't hesitate looking at the source code if you're looking for extreme detail, you'll always learn something new :D
Hope this helps!
Related
I would like to confirm some aspects that from reading all blogs and Databricks sources and the experts Holden, Warren et al, seem still poorly explained. I also note some unanswered questions out there in the net regarding this topic.
The following then:
My understanding is that for RDD's we have the DAG Scheduler that creates the Stages in a simple manner.
My understanding based on reading elsewhere to-date is that for DF's and DS's that we:
Use Catalyst instead of the DAG Scheduler. But I am not convinced.
Have Tungsten.
As DAG applies to DF's and DS's as well (obviously), I am left with 1 question - just to be sure:
Is it Catalyst that creates the Stages as well?
This may seem a silly question, but I noted a question on Disable Spark Catalyst Optimizer here on SO. That would imply no.
In addition, as the Spark paradigm is Stage based (shuffle boundaries), it seems to me that deciding Stages is not a Catalyst thing.
Therefore my conclusion is that the DAG Scheduler is still used for Stages with DF's and DS's, but I am looking for confirmation.
Moreover, this picture implies that there is still a DAG Scheduler.
This picture from the Databricks 2019 summit seems in contrast to the statement found on a blog:
An important element helping Dataset to perform better is Catalyst
Optimizer (CO), an internal query optimizer. It "translates"
transformations used to build the Dataset to physical plan of
execution. Thus, it's similar to DAG scheduler used to create physical
plan of execution of RDD. ...
I see many unanswered questions on SO on the DAGs with DF's etc. and a lot of stuff is out-of-date as it RDD related. So, as a consequence I asked a round a few of my connection with Spark knowledge on this and noted they were remiss in providing a suitable answer.
Let me try to clear these terminologies for you.
Spark Scheduler is responsible for scheduling tasks for execution. It manages where the jobs will be scheduled, will they be scheduled in parallel, etc. Spark Scheduler works together with Block Manager and Cluster Backend to efficiently utilize cluster resources for high performance of various workloads. DAGScheduler is a part of this.
Catalyst is the optimizer component of Spark. It performs query optimizations and creates multiple execution plans out of which the most optimized one is selected for execution which is in terms of RDDs.
Tungsten is the umbrella project that was focused on improving the CPU and memory utilization of Spark applications.
DAGScheduler is responsible for generation of stages and their scheduling. It breaks each RDD graph at shuffle boundaries based on whether they are "narrow" dependencies or have shuffle dependencies. It also determines where each task should be executed based on current cache status.
Well, I searched a bit more and found a 'definitive' source from the Spark Summit 2019 slide from David Vrba. It is about Spark SQL and shows the DAG Scheduler. So we can conclude that Catalyst does not decide anything on Stages. Or is this wrong and is the above answer correct and the below statement correct? The picture implies differently is my take, so no.
The statement I read elsewhere on Catalyst: ...
An important element helping Dataset to perform better is Catalyst
Optimizer (CO), an internal query optimizer. It "translates"
transformations used to build the Dataset to physical plan of
execution. Thus, it's similar to DAG scheduler used to create physical
plan of execution of RDD.
...
was a little misleading. That said, checking to be sure, elsewhere revealed no clear statements until this.
It looks like there are two ways to use spark as the backend engine for Hive.
The first one is directly using spark as the engine. Like this tutorial.
Another way is to use spark as the backend engine for MapReduce. Like this tutorial.
In the first tutorial, the hive.execution.engine is spark. And I cannot see hdfs involved.
In the second tutorial, the hive.execution.engine is still mr, but as there is no hadoop process, it looks like the backend of mr is spark.
Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr has been deprecated. But where is the hdfs involved?
I understood it differently.
Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.
But for a period now Spark can be used as execution engine for Spark.
https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.
Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.
Now what is DAG?
DAG is building logical dependencies before execution.(Think of it as a visual graph)
When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs.
DAG is build in Tez (right side of photo) but not in MapReduce (left side).
NOTE:
Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.
Reason 2:
Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge.
But in Apache Spark intermediate data is persist to memory which makes it faster.
Check this link for details
Is Spark using Map Reduce internally ? (his own map reduce)
The first time I hear somebody tell me, "Spark use map-reduce", I was so confused, I always learned that spark was the great adversary against Hadoop-Map Reduce.
After check in Google I just found a web-site that make some too short explanation about that : https://dzone.com/articles/how-does-spark-use-mapreduce
But the rest of Internet is about Spark vs Map Reduce.
Than somebody explain me that when spark make a RDD the data is split in different datasets and if you are using for example SPAR.SQL a query that should not be a map reduce like:
select student
from Table_students
where name = "Enrique"
Internally Spark is doing a map reduce to retrieve the Data( from the different datasets).
It´s that true ?
If I'm using Spark Mlib, to use machine learning, I always heard that machine learning is not compatible with map reduce because it need so many interactions and map reduce use batch processing..
In Spark Mlib, is Spark Internally using Map reduce too ?
Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. DAG is a strict generalization of MapReduce model.
This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.
So, Spark can write map-reduce program, but actually use DAG inside.
Reference:
Directed Acyclic Graph DAG in Apache Spark
What is Directed Acyclic Graph in Apache Spark?
What are the Apache Spark concepts around its DAGexecution engine, and its overall architecture?
How-to: Translate from MapReduce to Apache Spark
Spark is a DAG execution engine. Are not cyclic and DAG opposite concepts? It's surprising hard to find the answer to this apparent contradiction.
As you can see here: Understanding your Apache Spark Application Through Visualization, it is possible to visualize the execution DAG using the Spark UI. However, none of the examples in that page shows a cyclic data flow. In the following image you can see one of these examples.
Spark execution DAG example
Can these iterations (cyclic data flows) be outside the graph? I have read in MAPR that "Each Spark job creates a DAG of task stages to be performed on the cluster". Then, maybe the cyclic data flow occurs between DAGs (jobs).
Thank you.
Ok, it seems that it was a typo or something in the documentation. As of today, we can find this in the Spark homepage:
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
I agree that iterative and interactive programming paradigms are very good with spark than map-reduce. And I also agree that we can use HDFS or any hadoop data store like HBase as a storage layer for Spark.
Therefore, my question is - Do we have any use cases in real world that can say hadoop MR is better than apache spark on those contexts. Here "Better" is used in terms of performance, throughput, latency. Is hadoop MR is still the good one to do BATCH processing than using spark.
If so, Can any one please tell the advantages of hadoop MR over apache spark? Please keep the entire scope of discussion with respect to COMPUTATION LAYER.
As you said, in iterativeand interactive programming, the spark is better than hadoop. But spark has a huge need to the memory, if the memory is not enough, it would throw the OOM exception easily, hadoop can deal the situation very well, because hadoop has a good fault tolerant Mechanism.
Secondly, if Data Tilt happened, spark maybe also collapse. I compare the spark and hadoop on the system robustness, because this would decide the success of job.
Recently I test the spark and hadoop performance use some benchmark, according to the result, the spark performance is not better than hadoop on some load, e.g. kmeans, pagerank. Maybe the memory is a limitation to spark.