How do you eliminate data skew when joining large tables in pyspark?

How do you eliminate data skew when joining large tables in pyspark? - apache-spark

Table A has ~150M rows while Table B has about 60. In Table A, column_1 can and often does contain a large number of NULLS. This causes the data to become badly skewed and one executor ends up doing all of the work after LEFT JOINING.
I've read several posts on a solution but I've been unable to wrap my head around the different approaches that span several different versions of Spark.
What operation to do I need to take on Table A and what operation do I need to take on Table B to eliminate the skewed partitioning that occurs as a result of LEFT JOIN?
I'm using Spark 2.3.0 and writing in Python. In the code snippet below, I'm attempting to derive a new column that's devoid of NULLs (which would be used to execute the join), but I'm not sure where to take it (and I have no idea what to do with Table B)
new_column1 = when(col('column_1').isNull(), rand()).otherwise(col('column_1'))
df1 = df1.withColumn('no_nulls_here', new_column1)
df1.persist().count()

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?

There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.

Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.

The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.

Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

Most optimal way to removing Duplicates in pySpark

I am trying to remove duplicates in spark dataframes by using dropDuplicates() on couple of columns. But job is getting hung due to lots of shuffling involved and data skew. I have used 5 cores and 30GB of memory to do this. Data on which I am performing dropDuplicates() is about 12 million rows.
Please suggest me the most optimal way to remove duplicates in spark, considering data skew and shuffling involved.

Delete duplicate operations is an expensive operation as it compare values from one RDD to all other RDDs and tries to consolidate the results. Considering the size of your data results can time consuming.
I would recommend groupby transformation on the columns of your dataframe followed by commit action. This way only the consolidated results from your RDD will be compared with other RDD that too lazily and then you can request the result through any of the action like commit / show etc
transactions.groupBy("col1”,”col2").count.sort($"count".desc).show

distance():
df.select(['id', 'name']).distinct().show()
dropDuplicates()
df.dropDuplicates(['id', 'name']).show()
dropDuplicates() is the way to go if you want to drop duplicates over a subset of columns, but at the same time you want to keep all the columns of the original structure.

pyspark: Efficiently have partitionBy write to same number of total partitions as original table

I had a question that is related to pyspark's repartitionBy() function which I originally posted in a comment on this question. I was asked to post it as a separate question, so here it is:
I understand that df.partitionBy(COL) will write all the rows with each value of COL to their own folder, and that each folder will (assuming the rows were previously distributed across all the partitions by some other key) have roughly the same number of files as were previously in the entire table. I find this behavior annoying. If I have a large table with 500 partitions, and I use partitionBy(COL) on some attribute columns, I now have for example 100 folders which each contain 500 (now very small) files.
What I would like is the partitionBy(COL) behavior, but with roughly the same file size and number of files as I had originally.
As demonstration, the previous question shares a toy example where you have a table with 10 partitions and do partitionBy(dayOfWeek) and now you have 70 files because there are 10 in each folder. I would want ~10 files, one for each day, and maybe 2 or 3 for days that have more data.
Can this be easily accomplished? Something like df.write().repartition(COL).partitionBy(COL) seems like it might work, but I worry that (in the case of a very large table which is about to be partitioned into many folders) having to first combine it to some small number of partitions before doing the partitionBy(COL) seems like a bad idea.
Any suggestions are greatly appreciated!

You've got several options. In my code below I'll assume you want to write in parquet, but of course you can change that.
(1) df.repartition(numPartitions, *cols).write.partitionBy(*cols).parquet(writePath)
This will first use hash-based partitioning to ensure that a limited number of values from COL make their way into each partition. Depending on the value you choose for numPartitions, some partitions may be empty while others may be crowded with values -- for anyone not sure why, read this. Then, when you call partitionBy on the DataFrameWriter, each unique value in each partition will be placed in its own individual file.
Warning: this approach can lead to lopsided partition sizes and lopsided task execution times. This happens when values in your column are associated with many rows (e.g., a city column -- the file for New York City might have lots of rows), whereas other values are less numerous (e.g., values for small towns).
(2) df.sort(sortCols).write.parquet(writePath)
This options works great when you want (1) the files you write to be of nearly equal sizes (2) exact control over the number of files written. This approach first globally sorts your data and then finds splits that break up the data into k evenly-sized partitions, where k is specified in the spark config spark.sql.shuffle.partitions. This means that all values with the same values of your sort key are adjacent to each other, but sometimes they'll span a split, and be in different files. This, if your use-case requires all rows with the same key to be in the same partition, then don't use this approach.
There are two extra bonuses: (1) by sorting your data its size on disk can often be reduced (e.g., sorting all events by user_id and then by time will lead to lots of repetition in column values, which aids compression) and (2) if you write to a file format the supports it (like Parquet) then subsequent readers can read data in optimally by using predicate push-down, because the parquet writer will write the MAX and MIN values of each column in the metadata, allowing the reader to skip rows if the query specifies values outside of the partition's (min, max) range.
Note that sorting in Spark is more expensive than just repartitioning and requires an extra stage. Behind the scenes Spark will first determine the splits in one stage, and then shuffle the data into those splits in another stage.
(3) df.rdd.partitionBy(customPartitioner).toDF().write.parquet(writePath)
If you're using spark on Scala, then you can write a customer partitioner, which can get over the annoying gotchas of the hash-based partitioner. Not an option in pySpark, unfortunately. If you really want to write a custom partitioner in pySpark, I've found this is possible, albeit a bit awkward, by using rdd.repartitionAndSortWithinPartitions:
df.rdd \
.keyBy(sort_key_function) \ # Convert to key-value pairs
.repartitionAndSortWithinPartitions(numPartitions=N_WRITE_PARTITIONS,
partitionFunc=part_func) \
.values() # get rid of keys \
.toDF().write.parquet(writePath)
Maybe someone else knows an easier way to use a custom partitioner on a dataframe in pyspark?

df.repartition(COL).write().partitionBy(COL)
will write out one file per partition. This will not work well if one of your partition contains a lot of data. e.g. if one partition contains 100GB of data, Spark will try to write out a 100GB file and your job will probably blow up.
df.repartition(2, COL).write().partitionBy(COL)
will write out a maximum of two files per partition, as described in this answer. This approach works well for datasets that are not very skewed (because the optimal number of files per partition is roughly the same for all partitions).
This answer explains how to write out more files for the partitions that have a lot of data and fewer files for the small partitions.

pyspark: isin vs join

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:
Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs
broadcast?
This question is the spark analogue of the following question in Pig:
Pig: efficient filtering by loaded list
Additional context:
Pyspark isin function

Considering
import pyspark.sql.functions as psf
There are two types of broadcasting:
sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.
Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.
Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.

Both join and isin works well for all my daily workcases.
isin works well both of small and little large (~1M) set of list.
Note - If you have a large dataset (say ~500 GB) and you want to do filtering and then processing of filtered dataset, then
using isin the data read/processing is significantly very low and Fast. Whole 500 GB will not be loaded as you have already filtered the smaller dataset from .isin method.
But for the Join case, whole 500GB will loaded and processing. So Time of Processing will be much higher.
My case, After filtering using
isin, and then processing and converting to Pandas DF. It took < 60 secs
with JOIN and then processing and converting to Pandas DF. It takes > 1 hours.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string