How Spark shuffle operation works? - apache-spark

I'm learning Spark for my project and I'm in stuck with shuffle process in Spark. I want to know how this operation works internal. I found some keywords involved in this operation: ShuffleMapStage, ShuffleMapTask, ShuffledRDD, Shuffle Write, Shuffle Read....
My questions are:
1) Why we need ShuffleMapStage? When this stage is created and how it works?
2) When ShuffledRDD's compute method is called?
3) What are Shuffle Read and Shuffle Write?

The suffle operation consist to distribute coherent data on workers (repartition) using hash function on data key (data localilty problem).
This operation involves data transfert to organise data before perform an action, reduce the number of suffle operation increase the performance.
Shuffle operation are automatically called by Spark between 2 transformation to execute a final action.
Some Spark transformation need shuffle (like Group by, Join, sort)
Some Spark transformation doesn't need shuffle (like Union, Map, Reduce, Filter, count)

Related

What algorithm spark uses to bring same keys together

What algorithm Spark uses to identify similar keys and pushes the data to the next stage?
Scenarios include,
When I apply distinct(), I know a pre-distinct applied in the current stage and then the data is shuffled to the next stage. In this case, all the similar keys need to be in the same partition in the next stage.
When Dataset1 joins with Dataset2 (SortMergeJoin). In this case, all the similar keys in Dataset1 and Dataset2 needs to be in the same partition in next stage.
There are other scenarios as well, but overall picture is this.
How does Spark efficiently does this? and will there be any time lag between Stage1 and Stage2 when identifying the similar keys?
Algorithm Spark uses to partition the data is Hash by default. Also stages don't push but pull the data from previous stage.
Spark creates a stage boundaries whenever a shuffle is needed. Second stage will wait untill all the tasks in stage first complete and write their output to temp files. Second stage then starts pulling the data needed for its partitions from across the partitions written in stage 1.
Distinct as you see isn't as simple as it looks. Spark does distinct by applying aggregates. Also shuffling is needed because duplicates can be in multiple partitions. One of the conditions for shuffling is Spark needs a pair RDD and if your parent isn't one, it will create intermediary pair RDDs.
If you see the logical plan of Distinct, it would be more or less like
Parent RDD ---> Mapped RDD (record as key and null values) ---> MapPartitionsRDD (running distinct at partition level) ----> Shuffled RDD (pulling needed partitions data) ----> MapPartitionsRDD (distinct from segregated partitions for each key) ----> Mapped RDD (collecting only keys and discarding null values for result)
Spark uses RDD Dependency to achieve that data is shuffled to the next stage. And know which is a complexed processes;
The getDependencies function in RDD.SCALA is responsible to get the data from parent.
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps
And Some RDD dont have to get parent rdd, so the rdd dont implement the function compute,like DataSource RDD;
Shuffle rowRDD usually appear in chain compute, so it usually have the parent data to fetch.

When will the spark operation of join does not cause shuffle

In general,the join operation of Spark will cause shuffle. And when will the operation of join will not cause shuffle? And who can tell me some methods to optimize for Spark?
join will not cause shuffle directly if both data structures (either Dataset or RDD) are already co-partitioned. This means that data has been already shuffled with repartition / partitionBy or aggregation and partitioning schemes are compatible (the same partitioning key and number of partitions).
join will not cause network traffic if both structures are both co-partitioned and co-located. Since co-location happens only if data has been previously shuffled in the same actions this is a bordercase scenario.
Also shuffle doesn't occur when join is expressed as broadcast join.

Spark memory usage

I have read spark documentation and I would like to be sure I am doing the right thing.
https://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join,
etc) build a hash table within each task to perform the grouping,
which can often be large.
How does this solution comes with "input file split size"? my understanding is that a lot of tasks would create lot of small files.
Should I repartition data to smaller number of partitions after a shuffle operation?

Operations and methods to be careful about in Apache Spark?

What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:
groupByKey
collectAsMap
Why?
Are there other methods?
There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD on the result of rdd.toDebugString give those away.
If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:
groupByKey requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey -> bumm)
favor the use of aggregateByKey or reduceByKey to apply map-side reduction before collecting values for a key.
collect materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect -> bumm)
If you want to trigger the computation of an rdd, favor the use of rdd.count
To check the data of your rdd, use bounded operations like rdd.first (first element) or rdd.take(n) for n elements
If you really need to do collect, use rdd.filter or rdd.reduce to reduce its cardinality
collectAsMap is just collect behind the scenes
cartesian: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
consider adding keys and join in order to combine 2 rdds.
others?
In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.

Is there a way to check if a variable in Spark is parallelizable?

So I am using groupByKey function in spark, but its not being parallelized, as I can see that during its execution, only 1 core is being used. It seems that the data I'm working with doesn't allow parallelization. Is there a way in spark to know if the input data is amicable to parallelization or if it's not a proper RDD?
The unit of parallelization in Spark is the 'partition'. That is, RDDs are split in partitions and transformations are applied to each partition in parallel. How RDD data is distributed across partitions is determined by the Partitioner. By default, the HashPartitioner is used which should work fine for most purposes.
You can check how many partitions your RDD is split into using:
rdd.partitions // Array of partitions

Resources