when should groupByKey API used in spark programming? - apache-spark

GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?

Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.

According to the link below, GroupByKey should be avoided.
Avoid GroupByKey

Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle.
Good example would be TopN.
More on this -
https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey

I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is
1. Combine as alister mentioned above
2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions

Related

row_number and orderby window function narrow transformation equivalent in spark

df.withColumn("col1", functions.row_number().over(Window.partitionBy("col2").orderBy("col3")));
This operation I think is a wide transformation in nature where it will sort and partition and for every partition together. Due to which a lot of shuffling is occurring and causing performance issues.
I have a use case where data in each partition is independent of other partitions. So I want to run the above row number function for each partition only i.e I need a narrow transformation and I couldn't find a way to do it.
Is there a way to achieve it ?
I read orderBy vs sort methods which are wide and narrow respectively. Does that hold true in this case also?

Spark (large dataset) groupBy, sort, and then map

With a Spark rdd is there a way to groupByKey, then sort within each group, and then map for large datasets. The naive way of doing this maps over each group and creates a list for each group and sorts it. However this creation of a list will potentially cause out of memory problems for groups with many entries. Is there a way to have Spark do the sorting so as to avoid out of memory issues.
It sounds like you are getting a data skew error. This can happen when an executor runs out of memory because too much data is associated with that key. A solution to that problem would be to adjust/play with the number of executors and amount of RAM allocated to each...
However I believe this would be the solution to your problem:
JavaPairRDD<Key, Iterable<Value>> pair = ...;
JavaRDD<Iterable<Value>> values = pair.map(t2 -> t2._2());
JavaRDD<Value> onlyValues = values.flatMap(it -> it);
source: Convert iterable to RDD
Please follow up with this possible solution. I am genuinely curious.

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Understanding shuffle managers in Spark

Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful resources:
https://trongkhoanguyenblog.wordpress.com/
https://0x0fff.com/spark-architecture-shuffle/
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
Reading them, I understood there are different shuffle managers. I want to focus about two of them: hash manager and sort manager(which is the default manager).
For expose my question, I want to start from a very common transformation:
val rdd = reduceByKey(_ + _)
This transformation causes map-side aggregation and then shuffle for bringing all the same keys into the same partition.
My questions are:
Is Map-Side aggregation implemented using internally a mapPartition transformation and thus aggregating all the same keys using the combiner function or is it implemented with a AppendOnlyMap or ExternalAppendOnlyMap?
If AppendOnlyMap or ExternalAppendOnlyMap maps are used for aggregating, are they used also for reduce side aggregation that happens into the ResultTask?
What exaclty the purpose about these two kind of maps (AppendOnlyMap or ExternalAppendOnlyMap)?
Are AppendOnlyMap or ExternalAppendOnlyMap used from all shuffle managers or just from the sortManager?
I read that after AppendOnlyMap or ExternalAppendOnlyMap are full, are spilled into a file, how exactly does this steps happen?
Using the Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory is fill up, we start sorting map, spilling it to disk and then clean up the map, my question is : what is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system, but they are treat differently, Shuffle write records, are not put into the appendOnlyMap.
Can you explain in depth what happen when reduceByKey being executed, explaining me all the steps involved for to accomplish that? Like for example all the steps for map side aggregation, shuffling and so on.
It follows the description of reduceByKey step-by-step:
reduceByKey calls combineByKeyWithTag, with identity combiner and identical merge value and create value
combineByKeyWithClassTag creates an Aggregator and returns ShuffledRDD. Both "map" and "reduce" side aggregations use internal mechanism and don't utilize mapPartitions.
Agregator uses ExternalAppendOnlyMap for both combineValuesByKey ("map side reduction") and combineCombinersByKey ("reduce side reduction")
Both methods use ExternalAppendOnlyMap.insertAllMethod
ExternalAppendOnlyMap keeps track of spilled parts and the current in-memory map (SizeTrackingAppendOnlyMap)
insertAll method updates in-memory map and checks on insert if size estimated size of the current map exceeds the threshold. It uses inherited Spillable.maybeSpill method. If threshold is exceeded this method calls spill as a side effect, and insertAll initializes clean SizeTrackingAppendOnlyMap
spill calls spillMemoryIteratorToDisk which gets DiskBlockObjectWriter object from the block manager.
insertAll steps are applied for both map and reduce side aggregations with corresponding Aggregator functions with shuffle stage in between.
As of Spark 2.0 there is only sort based manager: SPARK-14667

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Resources