Optimizing spark sql Joins - apache-spark

I have a requirement to perform multiple joins in a Spark Sql job.. I have three data sources, lets call them a, b, c respectively. Each of them are read from S3 in plaintext CSV.
'a' and 'b' are large tables, with around 90m and 50m records respectively. Table 'c' is small, with just about 20k records.
The operation I am performing is
(a left join b where <filter something from a> group by <a.1,a.2,a.3,a.4>)
left join c
The final op has around 620k records which is supposed to be the same as the number of filtered and grouped records in 'a' (the 'where' and 'group by' clauses above considerably reduce the amount of data from table 'a').
The operation is currently taking around 17 mins on my fairly powerful cluster. I would like to know if this can be reduced through optimization. Please advise.
Best Regards

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

How do you eliminate data skew when joining large tables in pyspark?

Table A has ~150M rows while Table B has about 60. In Table A, column_1 can and often does contain a large number of NULLS. This causes the data to become badly skewed and one executor ends up doing all of the work after LEFT JOINING.
I've read several posts on a solution but I've been unable to wrap my head around the different approaches that span several different versions of Spark.
What operation to do I need to take on Table A and what operation do I need to take on Table B to eliminate the skewed partitioning that occurs as a result of LEFT JOIN?
I'm using Spark 2.3.0 and writing in Python. In the code snippet below, I'm attempting to derive a new column that's devoid of NULLs (which would be used to execute the join), but I'm not sure where to take it (and I have no idea what to do with Table B)
new_column1 = when(col('column_1').isNull(), rand()).otherwise(col('column_1'))
df1 = df1.withColumn('no_nulls_here', new_column1)
df1.persist().count()

Does it help to filter down a dataframe before a left outer join?

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.
TL;DR It is not possible to answer without data, but probably not.
Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:
It has to be highly selective.
Size of the key column is << size of all columns.
The first one is obvious. If there is reduction you search for nothing.
The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.
Using bloom filters might scale more gracefully (no need to shuffle).
If you have 100 fold difference in the number of records, it might be better to consider broadcast join.

Delete lots of rows from a very large Cassandra Table

I have a table Foo with 4 columns A, B, C, D. The partitioning key is A. The clustering key is B, C, D.
I want to scan the entire table and find all rows where D is in set (X, Y, Z).
Then I want to delete these rows but I don't want to "kill" Cassandra (because of compactions), I'd like these rows deleted with minimal disruption or risk.
How can I do this?
You have a big problem here. Indeed, you really can't find the rows without actually scanning all of your partitions. The problem real problem is that C* will allow you to restrict your queries with a partition key, and then by your cluster keys in the order in which they appear in your PRIMARY KEY table declaration. So if your PK is like this:
PRIMARY KEY (A, B, C, D)
then you'd need to filter by A first, then by B, C, and only at the end by D.
That being said, for the part of finding your rows, if this is something you have to run only once, you
Could scan all your table and do comparisons of D in your App logic.
If you know the values of A you could query every partition in parallel and then compare D in your application
You could attach a secondary index and try to exploit speed from there.
Please note that depending on how many nodes do you have 3 is really not an option, secondary indexes don't scale)
If you need to perform such tasks multiple times, I'd suggest you to create another table that would satisfy this query, something like PRIMARY KEY (D), you'd then just scan three partitions and that would be very fast.
About deleting your rows, I think there's no way to do it without triggering compactions, they are part of C* and you have to live with them. If you really can't tolerate tombstone creation and/or compactions, the only alternative is to not delete rows from a C* cluster, and that often means thinking about a new data model that won't need deletes.

Resources