Spark SQL Window functions - manual repartitioning necessary? - apache-spark

I am processing data partitioned by column "A" with PySpark.
Now, I need to use a window function over another column "B" to get the max value for this frame and count it up for new entries.
As it says here, "Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame."
Do I need to manually repartition the data by column "B" before applying the window, or does Spark does this automatically?
I.e. would I have to do:
data = data.repartition("B")
before:
w = Window().partitionBy("B").orderBy(col("id").desc())
Thanks a lot!

If you use Window.partitionBy(someCol), then if you have not set a value for shuffle partitions parameter, then the partitioning will default to 200.
A similar but not the same post should provide guidance. spark.sql.shuffle.partitions of 200 default partitions conundrum
So, in short you need not expressly perform the repartition, the shuffle partitions parameter is more relevant.

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

When i use repartition on a dataframe in pyspark it gives me one partition size to be zero and merges two types of keys together

Here is an example of it. The cell 44 output shows the count of distinct keys but when i find the partition size in cell 45 then it combines together 3 and 5. also on saving the size of one partition is still zero. Any help would be appreciated.
By default, Spark applies a HashPartitioner to the values of the column overint. Apparently, the values 3 and 5 fall into the same partition after being hashed.
You may want to choose the RangePartitioner. Or, in case you need full flexibility, you could also write your custome Partitioner class. However, this is only available on the RDD API and not the Structured API.

Spark SQL - orderBy decreases the number of partitions to value range, resulting in spill and ultimately no space on disk

I have a pretty straightforward pyspark SQL application (spark 2.4.4, EMR 5.29) that reads a dataframe of the schema topic, year, count:
df.show()
+--------+----+------+
| topic|year| count|
+--------+----+------+
|covid-19|2017|606498|
|covid-19|2016|454678|
|covid-19|2011| 10517|
|covid-19|2008| 6193|
|covid-19|2015|510391|
|covid-19|2013| 29551|
I then need to sort by year and collect counts to a list so that they be in ascending order, by year:
df.orderBy('year').groupBy('topic').agg(collect_list('count').alias('counts'))
The issue is, since I order by year, the number of partitions used for this stage is the number of years in my dataset. I thus get a crazy bottleneck stage where 15 out of 300 executors are utilised, leading to obvious memory spills and disk spills, eventually failing the stage due to no space left on device for the overpopulated partitions.
Even more interesting is that I found a way to circumvent this which intuitively appears to be much less efficient, but actually does work, since no bottlenecks are created:
df.groupBy('topic').pivot('year', values=range(START, FINISH)).agg(first('count')) \
.select('topic', array([col(c) for c in range(START, FINISH)]).alias('counts'))
This leads to my desired output, which is an array of counts sorted by year.
Anyone with an explanation or idea why this happens, or how best to prevent this?
I found this answer which and this jira where it is basically suggested to 'add noise' to the sort by key to avoid these skew related issues.
I think it is worth mentioning that the pivot method is a better resolution than adding noise, and to my knowledge whenever sorting by a column that has a small range of values. would appreciate any info on this and alternate implementations.
Range Partitioning is used for Sorting, ordering, under water by Spark.
From the docs it is clear that the calculation for determining the number of partitions that will contain ranges of data for sorting subsequently via mapPartitions,
is based on sampling from the existing partitions prior to computing some heuristically optimal number of partitions for these computed ranges.
These ranges which are partitions may decrease the number of partitions as a range must be contained with a single partition - for the order by / sort to work. Via mapPartitions type approach.
This:
df.repartitionByRange(100, 'some_col1', 'some_colN')...
can help or of you order by more columns I suspect. But here it appears not to be the case based on your DF.
The question has nothing to do with pyspark, BTW.
Interesting point, but explainable: reduced partitions needing to hold more data via collect_list based on year, there are obviously more topics than years.

Optimize Partitionning for billions of distinct keys

I'm processing a file each day with PySpark for contaning information about device navigation through the web. At the end of each month I want to use window functions in order to have the navigation journey for each device. It's a very slow processing, even with many nodes, so I'm looking for ways to speed it up.
My idea was to partition the data but I have 2 billion distinct keys, so partitionBy does not seem appropriate. Even bucketBy might not be a good choice because I create n buckets each day, so the files are not appended but for each day there are x parts of files that are created.
Does anyone have a solution ?
So here is an example of the export for each day (inside of each parquet file we find 9 partitions):
And here is the partitionBy query that we launch at the beggining of each month (compute_visit_number and compute_session_number are two udf that i've created on the notebook):
You want to ensure that each devices data is in the same partition to prevent exchanges when you do your window function. Or at least minimise the number of partitions the data could be in.
To do this I would create a column called partitionKey when you write the data - which contained a mod on the mc_device column - where the number you mod by is the number of partitions you want. Base this number of the size of the cluster that will run the end of month query. (If mc_device is not an integer then create a checksum first).
You can create a secondary partition on the date column if still needed.
Your end of month query should change:
w = Windows.partitionBy('partitionKey', 'mc_device').orderBy(event_time')
If you kept the date as a secondary partition column then repartition the dataframe to partitionKey only:
df = df.repartition('partitionKey')
At this point each devices data will be in the same partition and no exchanges should be needed. The sort should be faster and your query will hopefully complete in a sensible time.
If it is still slow you need more partitions when writing the data.

Running partition specific query in Spark Dataframe

I am working on spark streaming application, where I partition the data as per a certain ID in the data.
For eg: partition 0-> contains all data with id 100
partition 1 -> contains all data with id 102
Next I want to execute query on whole dataframe for final result. But my query is specific to each partition.
For eg: I need to run
select(col1 * 4) in case of partiton 0
while
select(col1 * 10) in case of parition 1.
I have looked into documentation but didnt find any clue. One solution i have is to create different RDDs/ Dataframe for different id in data. But that is not scalable in my case.
Any suggestion how to run query on dataframe where query can be specific to each partition.
Thanks
I think you should not couple your business logic with Spark's way of partitioning your data (you won't be able to repartition your data if required). I would suggest to add an artificial column in your DataFrame that equals with the partitionId value.
In any case, you can always do
df.rdd.mapPartitionsWithIndex{ case (partId, iter: Iterable[Row]) => ...}
See also the docs.

Resources