Spark OutOfMemory while repartition - apache-spark

I struggle with an OutOfMemory Exception in Spark which is thrown while doing repartition. The program is processing the following steps:
JavaRDD<A> data = sc.objectFile(this.inSource);
JavaPairRDD<String, A> dataWithKey = data.mapToPair(d -> new Tuple2<>(d.getUjid(), d));
JavaPairRDD<ADesc, AStats> dataInformation = dataWithKey.groupByKey()
.flatMapToPair(v -> getDataInformation(v._2()));
dataInformation.groupByKey().repartition(PARTITIONS).map(v -> merge(v._1(), v._2()));
getDataInformation maps a group of datapoints with the same id to several new datapoints:
Iterable<Tuple2<ADesc, AStats>> getDataInformation(Iterable<A> aIterator)
E.g.:
(ID1, Data1), (ID1,Data2), (ID1,Data3) -> (Data_Description_1, Stats1), (Data_Description_2, Stats2)
Information:
A is a datastructure containing some information. It is a quite basic structure.
Each datapoint A as an ID and several datapoints share a common ID. Therefore we map each datapoint to a tuple (ID, A)
We group the datapoints by ID and extract several new datapoints with getDataInformation.
Afterwards we want to group all statistics for the same data descriptions and merge them.
While merging we get an OutOfMemory. Therefore we insert a repartition and run out of memory as well. All stages including flatMapToPair work correctly. We tried different values for PARTITIONS until we were up to 5000 tasks whereby the most tasks have very little work to do, while some have to progress a few MB and 3 tasks (independent from the number of partitions) always run out of memory. My question is why spark shuffels the data very unbalanced and is running out of memory while doing a repartition?

I solved my problem and give a short overview. Maybe someone will find this useful in future.
The problem was in
dataInformation.groupByKey().repartition(PARTITIONS).map(v -> merge(v._1(), v._2()));
I had a lot of objects with the same key that should be merged and therefore a lot of objects were on the same partition and the task went OOM. I changed the code and used reducedByKey and modified the merge function that it does not merge all objects with the same key but merges 2 objects with the same key. Because the function is associative the result is the same.
In short: groupByKey grouped to many objects to one task

Related

HashPartioning dataframes to achieve co-partitioning during join in PySpark

I am trying to figure out the best way to achieve co-partitioning on my two datasets to eliminate join related shuffles. I'm working with 2 dataframes A and B where A contains minimal user date including a field for event IDs they interacted with, and B contains detailed information about the events. I am trying to join on 3 fields: day, event_type, and event_id. A and B need to be read from disk as they will be written to and read from by external clients on an ongoing basis.
The main goal of the project I'm working on is to enable the ability to quickly:
Filter by event_type
Join raw event details to user IDs
I understand that in order to achieve #1 I probably need to partition my parquet files on event_type so that the directory structure achieves easier filtering. In order to achieve #2 I should try to minimize shuffles as much as possible by means of co-partitioning keys from the two dataframes.
The data I'm working with consists of 3 days of event data (~12M rows per event type) and the goal is to get this working efficiently for 1-3 years of data.
In order to improve my join I first begin by filtering on the event_type I am interested in to narrow down the data on both dataframes. I then do the actual join on day and event_id. This naturally will result in shuffles since there is no co-partitioning so I've tried to address that using hash partitioning.
I read that repartition implements hash partitioning on the specified columns. I save my dataframes to disk and also include a partitionBy('day', 'event_type') in order to achieve better performance on filtering/grouping operations.
A\
.repartition('day', 'event_id')\
.write
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/A')
B\
.repartition('day', 'event_id')\
.write\
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/B')
...
...
A = spark.read.parquet('/path/to/A')
B = spark.read.parquet('/path/to/B')
A.filter(col('event_type') == 'X')\
.join(B.filter(col('event_type) == 'X'), on=['day', event_id'], how='inner')\
.show()
When I execute this I still see a shuffle exchange in the plan as well as shuffle writes which take up around 5-10GB each. I also see longer executor compute times of around 21-41s which might not seem much on 3 days of data but might blow up for yearly data.
I am wondering what's a better way I can go about doing this - or if it is even possible to eliminate shuffles when working with dataframes? Answers to this question seem to suggest that it might be possible but not a great idea?
I am not even sure that doing both a repartition and a partitionBy is the correct approach. Is the initial partitioning using repartition() preserved at all when I re-read the parquet files from disk? I have read that this might not be the case - overall the information available seems either conflicting or without explicit sources attached.
Thank you for taking the time to help.

How to build a custom spark partitioner to avoid exchange / shuffle steps

Version: DBR 8.4 | Spark 3.1.2
While reading solutions to How to avoid shuffles while joining DataFrames on unique keys?, I've found a few mentions of the need for to create a "custom partitioner", but I can't find any information on that.
I've noticed that in the ~4 hour job I'm currently trying to optimize, most of the time goes to exchanging terabytes of data from a temporary cross-join-and-reduce operation.
Here is a visualization of the current operation:
I'm hoping that if I can set up the cross-join operation with a "custom partitioner" I can force the ~29 billion rows from the cross join operation (which shares the same 2-column primary key with the left joined ~0.6 billion row table) to stay on the workers they were generated on until the whole dataset can be reduced to a mere 1 million rows. i.e. I'm hoping to avoid any shuffles during this time.
The steps in the operation are:
Generate 28 billion rows temporary "TableA" partitioned by 'columnA', keyed by ['columnA', 'columnB']
Left join 1 billion rows "TableB" also partitioned by 'columnA', keyed by ['columnA', 'columnB'] (Kind of a sparse version of temp Table A)
Project a new column (TableC.columnC = TableA.columnC - Coalesce(TableB.columnC, 0) in this specific case)
Project a new row_order() column within each partition e.g. F.row_number().over( Window.partitionBy(['columnA', 'columnB']).orderBy(F.col('columnC').desc())
Take the top N (say 2) - so filter out only the rows with rank (row_number) < 3 (for example), throwing away the other 49,998 rows per-partition.
Since all of these operations are independently performed within each partition ['columnA', 'columnB'] (no interactions between partitions), I was hoping there was some way that I can get through all 5 of those steps without ever reshuffling partitions between workers.
What I've tried:
I've tried not specifying any repartitioning instructions at all, this leads to the 3.5 hours time and the DAG below.
I've tried explicitly specifying .repartition(9600, 'columnA') on each data source on both sides of the join (excluding the broadcast join case), right before joining. (Note that '9600' is configured as the default number of shuffle partitions to use). This code change resulted in no changes to the query plan - there is still an exchange step happening both before and after the sort-merge-join.

Spark (large dataset) groupBy, sort, and then map

With a Spark rdd is there a way to groupByKey, then sort within each group, and then map for large datasets. The naive way of doing this maps over each group and creates a list for each group and sorts it. However this creation of a list will potentially cause out of memory problems for groups with many entries. Is there a way to have Spark do the sorting so as to avoid out of memory issues.
It sounds like you are getting a data skew error. This can happen when an executor runs out of memory because too much data is associated with that key. A solution to that problem would be to adjust/play with the number of executors and amount of RAM allocated to each...
However I believe this would be the solution to your problem:
JavaPairRDD<Key, Iterable<Value>> pair = ...;
JavaRDD<Iterable<Value>> values = pair.map(t2 -> t2._2());
JavaRDD<Value> onlyValues = values.flatMap(it -> it);
source: Convert iterable to RDD
Please follow up with this possible solution. I am genuinely curious.

Spark Dataset join performance

I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html

Performance of hazelcast using executors on imap with million entries

We are applying few predicates on imap containing just 100,000 objects to filter data. These predicates will change per user. While doing POC on my local machine (16 GB) with two nodes(each node shows 50000) and 100,000 records, I am getting output in 30 sec which is way more than querying the database directly.
Will increasing number of nodes reduce the time, I even tried with PagingPredicate but it takes around 20 sec for each page
IMap objectMap = hazelcastInstance.getMap("myMap");
MultiMap resultMap = hazelcastInstance.getMap("myResultMap");
/*Option 1 : passing hazelcast predicate for imap.values*/
objectMap.values(predicate).parallelStream().forEach(entry -> resultMap(userId, entry));
/*Option 2: applying java predicate to entrySet OR localkeyset*/
objectMap.entrySet.parallelstream().filter(predicate).forEach(entry -> resultMap(userId, entry));
More nodes will help, but the improvement is difficult to quantify. It could be large, it could be small.
Part of the work in the code sample involves applying a predicate across 100,000 entries. If there is no index, the scan stage checks 50,000 entries per node if there are 2 nodes. Double up to 4 nodes, each has 25,000 entries to scan and so the scan time will half.
The scan time is part of the query time, the overall result set also has to be formed from the partial results from each node. So doubling the number of nodes might nearly half the run time as a best case, or it might not be a significant improvement.
Perhaps the bigger question here is what are you trying to achieve ?
objectMap.values(predicate) in the code sample retrieves the result set to a central point, which then has parallelStream() applied to try to merge the results in parallel into a MultiMap. So this looks like more of an ETL than a query.
Use of executors as per the title, and something like objectMap.localKeySet(predicate) might allow this to be parallelised out better, as there would be no central point holding intermediate results.

Resources