DAG Scheduler vs. Catalyst for Spark - apache-spark

I would like to confirm some aspects that from reading all blogs and Databricks sources and the experts Holden, Warren et al, seem still poorly explained. I also note some unanswered questions out there in the net regarding this topic.
The following then:
My understanding is that for RDD's we have the DAG Scheduler that creates the Stages in a simple manner.
My understanding based on reading elsewhere to-date is that for DF's and DS's that we:
Use Catalyst instead of the DAG Scheduler. But I am not convinced.
Have Tungsten.
As DAG applies to DF's and DS's as well (obviously), I am left with 1 question - just to be sure:
Is it Catalyst that creates the Stages as well?
This may seem a silly question, but I noted a question on Disable Spark Catalyst Optimizer here on SO. That would imply no.
In addition, as the Spark paradigm is Stage based (shuffle boundaries), it seems to me that deciding Stages is not a Catalyst thing.
Therefore my conclusion is that the DAG Scheduler is still used for Stages with DF's and DS's, but I am looking for confirmation.
Moreover, this picture implies that there is still a DAG Scheduler.
This picture from the Databricks 2019 summit seems in contrast to the statement found on a blog:
An important element helping Dataset to perform better is Catalyst
Optimizer (CO), an internal query optimizer. It "translates"
transformations used to build the Dataset to physical plan of
execution. Thus, it's similar to DAG scheduler used to create physical
plan of execution of RDD. ...
I see many unanswered questions on SO on the DAGs with DF's etc. and a lot of stuff is out-of-date as it RDD related. So, as a consequence I asked a round a few of my connection with Spark knowledge on this and noted they were remiss in providing a suitable answer.

Let me try to clear these terminologies for you.
Spark Scheduler is responsible for scheduling tasks for execution. It manages where the jobs will be scheduled, will they be scheduled in parallel, etc. Spark Scheduler works together with Block Manager and Cluster Backend to efficiently utilize cluster resources for high performance of various workloads. DAGScheduler is a part of this.
Catalyst is the optimizer component of Spark. It performs query optimizations and creates multiple execution plans out of which the most optimized one is selected for execution which is in terms of RDDs.
Tungsten is the umbrella project that was focused on improving the CPU and memory utilization of Spark applications.
DAGScheduler is responsible for generation of stages and their scheduling. It breaks each RDD graph at shuffle boundaries based on whether they are "narrow" dependencies or have shuffle dependencies. It also determines where each task should be executed based on current cache status.

Well, I searched a bit more and found a 'definitive' source from the Spark Summit 2019 slide from David Vrba. It is about Spark SQL and shows the DAG Scheduler. So we can conclude that Catalyst does not decide anything on Stages. Or is this wrong and is the above answer correct and the below statement correct? The picture implies differently is my take, so no.
The statement I read elsewhere on Catalyst: ...
An important element helping Dataset to perform better is Catalyst
Optimizer (CO), an internal query optimizer. It "translates"
transformations used to build the Dataset to physical plan of
execution. Thus, it's similar to DAG scheduler used to create physical
plan of execution of RDD.
...
was a little misleading. That said, checking to be sure, elsewhere revealed no clear statements until this.

Related

What exactly are the benefits of Spark over Mapreduce if I'm doing a batch processing?

I know Spark has in memory capability that is very useful for iterative jobs. But what if my requirement is traditional batch processing ETL. Does Spark provide me any benefit there? Please give all the pointers related to this, it will help me a lot.
How does Spark help me in case there are no iterative work and it's a batch process?
Is there any scenario where MapReduce would perform better than Spark? Any scenario where MR will be better than Spark?
Assuming you know Map Reduce, then consider:
writing Word Counting in MR when you need to list the top N words. Far more work over multiple Steps in MR vs. 7 or 8 lines in Spark.
for those with dimension processing a la dimensional model, a lot easier to do in Spark.
Spark Structured Streaming use cases...
Certain tasks with extreme high amounts of data may well be better using MR if you cannot acquire enough hardware or Cloud compute resources, i.e. writing to disk and processing per functional step.

What is optimal number of Stages in Spark App?

Is there some rule of thumb or best practice regarding number of stages in Spark job?
When do you consider breaking job in smaller pieces?
I found smaller jobs easier to analyze and optimize, but on the other hand loading/extracting data between each job comes with a cost.
There is no hard rule about optimal number of Stages for a Spark App.
It depends on what your functionality is that dictates the number of Stages.
Certain aspects result in Stages due to the Spark Architecture - which makes sense.
But Catalyst & Tungsten optimize and fuse code but cannot obviate 'shuffle boundaries" which means a new Stage.That is also not their task. The DAG Scheduler (underwater for Dataframes) does that.
You can .cache things to reduce re-computation for subsequent Actions in a Spark App, but that has a certain cost as well.
You can use things that reduce "shuffling", e.g. reduceByKey for legacy RDDs.
For Dataframes, DataSets Spark will generate more optimum Execution Plans (in general) and indeed some extra Stages (for computing pivot vales when using pivot.)
You sort of answer partially your own question with the aspect of writing, loading, but bucketBy can help wth such an approach. However, I am not sure how why the complexity is more with a larger Spark App - unless you mean using intermediate tables with less JOINs, UNIONs in smaller pieces. But the number of Stages then is only a consequence, and not so much a deciding factor for Stages.

What are the differences between GraphX's memory-based shuffle and Spark Core's shuffle

From the paper "GraphX: Graph Processing in a Distributed Dataflow Framework" (Gonzalez et al. 2014) I learned that GraphX modified Spark shuffle:
Memory-based Shuffle: Spark’s default shuffle implementation materializes the temporary data to disk. We modified the shuffle phase to materialize map outputs in memory and remove this temporary data using a timeout.
(The paper does not explain anything more on this point.)
It seems that this change aims at optimizing shuffles in the context of highly iterative graph processing algorithms.
How does this "Memory-based shuffle" works exactly, how it differs from the Spark Core's one and what are the pros and cons: why it is well suited for graphx use cases and not for other Spark jobs ?
I failed to understand the big picture directly from GraphX/Spark sources and I also struggled finding the information out there.
Apart from an ideal answer, comments with links to sources are welcomed too.
I failed to understand the big picture directly from GraphX/Spark sources
Because it was never included in the mainstream distribution.
Back when the first GraphX version was developed Spark used Hash based shuffle, which was rather inefficient. It was one of the main bottlenecks in Spark jobs, and there was significant research into developing of alternative shuffle strategies.
Since GraphX algorithms are iterative and join-based, improving shuffle speed was an obvious path.
Since then, pluggable shuffle manager has been introduced, as well as new sort based shuffle, which finally turned out to be fast enough to make both hash-based shuffle and ongoing work on providing generic memory-based shuffle obsolete.

What is the difference between Map Reduce and Spark about engine in Hive?

It looks like there are two ways to use spark as the backend engine for Hive.
The first one is directly using spark as the engine. Like this tutorial.
Another way is to use spark as the backend engine for MapReduce. Like this tutorial.
In the first tutorial, the hive.execution.engine is spark. And I cannot see hdfs involved.
In the second tutorial, the hive.execution.engine is still mr, but as there is no hadoop process, it looks like the backend of mr is spark.
Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr has been deprecated. But where is the hdfs involved?
I understood it differently.
Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.
But for a period now Spark can be used as execution engine for Spark.
https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.
Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.
Now what is DAG?
DAG is building logical dependencies before execution.(Think of it as a visual graph)
When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs.
DAG is build in Tez (right side of photo) but not in MapReduce (left side).
NOTE:
Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.
Reason 2:
Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge.
But in Apache Spark intermediate data is persist to memory which makes it faster.
Check this link for details

What does "cyclic data flow" mean in Apache Spark?

Spark is a DAG execution engine. Are not cyclic and DAG opposite concepts? It's surprising hard to find the answer to this apparent contradiction.
As you can see here: Understanding your Apache Spark Application Through Visualization, it is possible to visualize the execution DAG using the Spark UI. However, none of the examples in that page shows a cyclic data flow. In the following image you can see one of these examples.
Spark execution DAG example
Can these iterations (cyclic data flows) be outside the graph? I have read in MAPR that "Each Spark job creates a DAG of task stages to be performed on the cluster". Then, maybe the cyclic data flow occurs between DAGs (jobs).
Thank you.
Ok, it seems that it was a typo or something in the documentation. As of today, we can find this in the Spark homepage:
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.

Resources