How mapping/reducing phases work in Spark - apache-spark

I'm coming from a MapReduce background and I'm quite new to Spark. I could not find an article explaining the architectural difference between MapReduce and Spark. My understanding so far is the only difference the MapReduce and Spark have is the notion of 'in-memory' processing. That is, the Spark has mapping/reducing phase and they might run on two different nodes within the cluster. Pairs with the same keys are transferred to the same reducer and there is a shuffling phase involved. Am I correct? or there is some difference in the way mapping and reducing stages are done and...

I think it's directly on point, so I don't mind pointing you to a blog post I wrote:
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
Spark is a large superset of MapReduce, in the sense that you can express MapReduce with Spark operators, but a lot of other things too. It has a large set of small operations from which you construct pipelines. So there's not a 1:1 mapping, but, you can identify how a lot of MapReduce elements correspond to Spark. Or: MapReduce actually gives you two operations that do a lot more than 'map' and 'reduce', which may not have been obvious so far.

Related

What exactly are the benefits of Spark over Mapreduce if I'm doing a batch processing?

I know Spark has in memory capability that is very useful for iterative jobs. But what if my requirement is traditional batch processing ETL. Does Spark provide me any benefit there? Please give all the pointers related to this, it will help me a lot.
How does Spark help me in case there are no iterative work and it's a batch process?
Is there any scenario where MapReduce would perform better than Spark? Any scenario where MR will be better than Spark?
Assuming you know Map Reduce, then consider:
writing Word Counting in MR when you need to list the top N words. Far more work over multiple Steps in MR vs. 7 or 8 lines in Spark.
for those with dimension processing a la dimensional model, a lot easier to do in Spark.
Spark Structured Streaming use cases...
Certain tasks with extreme high amounts of data may well be better using MR if you cannot acquire enough hardware or Cloud compute resources, i.e. writing to disk and processing per functional step.

What is optimal number of Stages in Spark App?

Is there some rule of thumb or best practice regarding number of stages in Spark job?
When do you consider breaking job in smaller pieces?
I found smaller jobs easier to analyze and optimize, but on the other hand loading/extracting data between each job comes with a cost.
There is no hard rule about optimal number of Stages for a Spark App.
It depends on what your functionality is that dictates the number of Stages.
Certain aspects result in Stages due to the Spark Architecture - which makes sense.
But Catalyst & Tungsten optimize and fuse code but cannot obviate 'shuffle boundaries" which means a new Stage.That is also not their task. The DAG Scheduler (underwater for Dataframes) does that.
You can .cache things to reduce re-computation for subsequent Actions in a Spark App, but that has a certain cost as well.
You can use things that reduce "shuffling", e.g. reduceByKey for legacy RDDs.
For Dataframes, DataSets Spark will generate more optimum Execution Plans (in general) and indeed some extra Stages (for computing pivot vales when using pivot.)
You sort of answer partially your own question with the aspect of writing, loading, but bucketBy can help wth such an approach. However, I am not sure how why the complexity is more with a larger Spark App - unless you mean using intermediate tables with less JOINs, UNIONs in smaller pieces. But the number of Stages then is only a consequence, and not so much a deciding factor for Stages.

What is the difference between Map Reduce and Spark about engine in Hive?

It looks like there are two ways to use spark as the backend engine for Hive.
The first one is directly using spark as the engine. Like this tutorial.
Another way is to use spark as the backend engine for MapReduce. Like this tutorial.
In the first tutorial, the hive.execution.engine is spark. And I cannot see hdfs involved.
In the second tutorial, the hive.execution.engine is still mr, but as there is no hadoop process, it looks like the backend of mr is spark.
Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr has been deprecated. But where is the hdfs involved?
I understood it differently.
Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.
But for a period now Spark can be used as execution engine for Spark.
https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.
Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.
Now what is DAG?
DAG is building logical dependencies before execution.(Think of it as a visual graph)
When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs.
DAG is build in Tez (right side of photo) but not in MapReduce (left side).
NOTE:
Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.
Reason 2:
Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge.
But in Apache Spark intermediate data is persist to memory which makes it faster.
Check this link for details

Use Spark internally Map-Reduce?

Is Spark using Map Reduce internally ? (his own map reduce)
The first time I hear somebody tell me, "Spark use map-reduce", I was so confused, I always learned that spark was the great adversary against Hadoop-Map Reduce.
After check in Google I just found a web-site that make some too short explanation about that : https://dzone.com/articles/how-does-spark-use-mapreduce
But the rest of Internet is about Spark vs Map Reduce.
Than somebody explain me that when spark make a RDD the data is split in different datasets and if you are using for example SPAR.SQL a query that should not be a map reduce like:
select student
from Table_students
where name = "Enrique"
Internally Spark is doing a map reduce to retrieve the Data( from the different datasets).
It´s that true ?
If I'm using Spark Mlib, to use machine learning, I always heard that machine learning is not compatible with map reduce because it need so many interactions and map reduce use batch processing..
In Spark Mlib, is Spark Internally using Map reduce too ?
Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. DAG is a strict generalization of MapReduce model.
This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.
So, Spark can write map-reduce program, but actually use DAG inside.
Reference:
Directed Acyclic Graph DAG in Apache Spark
What is Directed Acyclic Graph in Apache Spark?
What are the Apache Spark concepts around its DAGexecution engine, and its overall architecture?
How-to: Translate from MapReduce to Apache Spark

How to achieve vertical parallelism in spark?

Is it possible to run multiple calculations in parallel using spark?
Example cases that could benefit from that:
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:
for in_path, out_path in long_ds_list:
spark.read(in_path).select('column').distinct().write(out_path)
The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.
I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.
You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.
Let's discuss the first idea.
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.
I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.
Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).
Wrapping up, I think both are possible with some development (i.e. not available today).

Resources