Spark SQL - group by after repartitioning - apache-spark

I want to group all items in a source based on a specified, pre-defined category. The number of items per category could be in the order of millions. The groupBy helps me achieve this, but I want to understand if repartitioning on the product-type before grouping, would be more efficient?
The source for the spark jobs is hive tables. The version of spark is latest 2.4.4. The problem statement for me is that I want to run a customised similarity algorithm for every item with every other item in a given category. So, by the end of this operation, for every item, I would have the 10 most similar items to it.
Since this involves a groupBy operation and since groupBy involves shuffling of data, I thought first I would repartition the data based upon the category. I can even set the number of partitions to the number of categories that I have(in the magnitude of 100s).
Once data is re-partitioned sent on individual workers, running groupBy should be a local operation- if I do the groupBy on the same type. Is this assumption correct?
// For demo, I am reading from CSV. The final source is a hive table
Dataset<Row> rows = spark.read().option("sep", "\t")
.csv("<some path>")
.repartition(20, new Column("category"))
.cache();
Dataset<Row> ids_grouped_by_category = rows.map((MapFunction<Row, Row>) items -> {
// Some transformation returns a row in the format I need.
return new-row;
}, <encoder>)
.groupBy(functions.col("category"))
.agg(functions.collect_list("category").as("ids"));
At the end of this operation, I have been able to group all item-ids for a given category into a list. Something like this:
+---------------------------+------------------------------------------+
|category | ids |
+---------------------------+------------------------------------------+
|category-1 | [id1, id2...] |
|category-2 | [idx, idy...] |
+---------------------------+------------------------------------------+
I have been able to get the data in the format I need but wanted to understand is this way of doing a group-by correct?
Also, what are the implications of doing a collectList operation? Does it load everything in-memory?

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Joining two large tables which have large regions of no overlap

Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Running partition specific query in Spark Dataframe

I am working on spark streaming application, where I partition the data as per a certain ID in the data.
For eg: partition 0-> contains all data with id 100
partition 1 -> contains all data with id 102
Next I want to execute query on whole dataframe for final result. But my query is specific to each partition.
For eg: I need to run
select(col1 * 4) in case of partiton 0
while
select(col1 * 10) in case of parition 1.
I have looked into documentation but didnt find any clue. One solution i have is to create different RDDs/ Dataframe for different id in data. But that is not scalable in my case.
Any suggestion how to run query on dataframe where query can be specific to each partition.
Thanks
I think you should not couple your business logic with Spark's way of partitioning your data (you won't be able to repartition your data if required). I would suggest to add an artificial column in your DataFrame that equals with the partitionId value.
In any case, you can always do
df.rdd.mapPartitionsWithIndex{ case (partId, iter: Iterable[Row]) => ...}
See also the docs.

Does Spark's RDD.combineByKey() preserve the order of a previously sorted DataFrame?

I've done this in PySpark:
Created a DataFrame using a SELECT statement to get asset data ordered by asset serial number and then time.
Used DataFrame.map() to convert the DataFrame to an RDD.
Used RDD.combineByKey() to collate all the data for each asset, using the asset's serial number as the key.
Question: Can I be certain that the data for each asset will still be sorted in time order in the RDD resulting from the last step?
Time order is crucial for me (I need to calculate statistics over a moving time window across the data for each asset). When RDD.combineByKey() combines data from different nodes in the Spark cluster for a given key, is any order in that key's data retained? Or is the data from the different nodes combined in no particular order for a given key?
Can I be certain that the data for each asset will still be sorted in time order in the RDD resulting from the last step?
You cannot. When you apply sort across multiple dimensions (data ordered by asset serial number and then time) records for a single asset can be spread across multiple partitions. combineByKey will require a shuffle and the order in which these parts are combined is not guaranteed.
You can try with repartition and sortWithinPartitions (or its equivalent on RDDs):
df.repartition("asset").sortWithinPartitions("time")
or
df.repartition("asset").sortWithinPartitions("asset", "time")
or window functions with frame definition as follows:
w = Window.partitionBy("asset").orderBy("time")
In Spark >= 2.0 window functions can be used with UserDefinedFunctions so if you're fine with writing your own SQL extensions in Scala you can skip conversion to RDD completely.

Resources