Spark (large dataset) groupBy, sort, and then map - apache-spark

With a Spark rdd is there a way to groupByKey, then sort within each group, and then map for large datasets. The naive way of doing this maps over each group and creates a list for each group and sorts it. However this creation of a list will potentially cause out of memory problems for groups with many entries. Is there a way to have Spark do the sorting so as to avoid out of memory issues.

It sounds like you are getting a data skew error. This can happen when an executor runs out of memory because too much data is associated with that key. A solution to that problem would be to adjust/play with the number of executors and amount of RAM allocated to each...
However I believe this would be the solution to your problem:
JavaPairRDD<Key, Iterable<Value>> pair = ...;
JavaRDD<Iterable<Value>> values = pair.map(t2 -> t2._2());
JavaRDD<Value> onlyValues = values.flatMap(it -> it);
source: Convert iterable to RDD
Please follow up with this possible solution. I am genuinely curious.

Related

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Spark OutOfMemory while repartition

I struggle with an OutOfMemory Exception in Spark which is thrown while doing repartition. The program is processing the following steps:
JavaRDD<A> data = sc.objectFile(this.inSource);
JavaPairRDD<String, A> dataWithKey = data.mapToPair(d -> new Tuple2<>(d.getUjid(), d));
JavaPairRDD<ADesc, AStats> dataInformation = dataWithKey.groupByKey()
.flatMapToPair(v -> getDataInformation(v._2()));
dataInformation.groupByKey().repartition(PARTITIONS).map(v -> merge(v._1(), v._2()));
getDataInformation maps a group of datapoints with the same id to several new datapoints:
Iterable<Tuple2<ADesc, AStats>> getDataInformation(Iterable<A> aIterator)
E.g.:
(ID1, Data1), (ID1,Data2), (ID1,Data3) -> (Data_Description_1, Stats1), (Data_Description_2, Stats2)
Information:
A is a datastructure containing some information. It is a quite basic structure.
Each datapoint A as an ID and several datapoints share a common ID. Therefore we map each datapoint to a tuple (ID, A)
We group the datapoints by ID and extract several new datapoints with getDataInformation.
Afterwards we want to group all statistics for the same data descriptions and merge them.
While merging we get an OutOfMemory. Therefore we insert a repartition and run out of memory as well. All stages including flatMapToPair work correctly. We tried different values for PARTITIONS until we were up to 5000 tasks whereby the most tasks have very little work to do, while some have to progress a few MB and 3 tasks (independent from the number of partitions) always run out of memory. My question is why spark shuffels the data very unbalanced and is running out of memory while doing a repartition?
I solved my problem and give a short overview. Maybe someone will find this useful in future.
The problem was in
dataInformation.groupByKey().repartition(PARTITIONS).map(v -> merge(v._1(), v._2()));
I had a lot of objects with the same key that should be merged and therefore a lot of objects were on the same partition and the task went OOM. I changed the code and used reducedByKey and modified the merge function that it does not merge all objects with the same key but merges 2 objects with the same key. Because the function is associative the result is the same.
In short: groupByKey grouped to many objects to one task

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

when should groupByKey API used in spark programming?

GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?
Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.
According to the link below, GroupByKey should be avoided.
Avoid GroupByKey
Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle.
Good example would be TopN.
More on this -
https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is
1. Combine as alister mentioned above
2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions

How to find the number of keys created in map part?

I am trying to write Spark application that would find me the number of keys that has been created in the map function. I could find no function that would allow me to do that.
One way I've thought of is using accumulator where I'd add 1 to the accumulator variable in the reduce function. My idea is based on the assumption that accumulator variables are shared across nodes as counters.
Please guide.
if you are looking something like the Hadoop counters in spark, the most accurate approximation is an Accumulator that you can increase in every task, but you do not have any information of the amount of data that Spark has processed so far.
If you only want to know how many distinct keys do you have in your rdd, you could do something like a count of the distinct mapped keys (rdd.map(t=>t_1)).distinct.count)
Hope this will be useful for you

Resources