How to optimize joins between pyspark dataframes using different, multiple join keys? - apache-spark

General join optimization techniques mostly apply either for joins between tables that share the same join key which makes it possible to use the same partition for multiple tables and joins.
However, assume I have multiple large tables containing terabytes of data for which broadcast joins won't work. Each table has different join keys to other tables, and some tables must be joined on two keys. For example, table A is joined to table B using join key (ab). Table B is joined to table C using join keys (bc1, bc2). How can I optimize the join between tables A and C, given that each join uses different keys?
I've already tried:
Using the same partitioner for all 3 tables. These tables do have one column in common, so I have partitioned this column. However I'm not sure if this actually optimizes the joins as this column is not the join key.
Using a two-pass method proposed in other posts. It's much faster than (1) but I would like to improve this method as this relies on a single partition column across all tables.

Related

How do column data types affect join performance in SPARK or Databricks environment?

I was recently introduced to DBT tool. One downside of the tool is that you cannot create an identity column (surrogate keys) as sequences. You can only generate hash of columns to uniquely identify rows.
Due to that reason, I was trying to find out what would be the impact of surrogate keys as a hash of different columns (string data type) compared to sequence numbers (integer data type) when joining tables in Spark or Databricks environment (Fact tables have surrogate keys from dimension tables as foreign keys. So, both table columns participating in join will have the same data types).
So far, I can only find optimisation techniques for joins by handling data skewness, broadcasting, reducing shuffling etc. Haven't seen anything related to the impact of column types on joins.
For example, as a best practice for BigQuery the recommendation is "use INT64 data types in joins to reduce cost and improve comparison performance".
So, to elaborate my question: Does Integer data type have better join performance than string data type when joining tables with Databricks SQL (or Spark SQL)? Or column types have almost no impact?
I read a lot of blogs on performance optimisation for SPARK and Databricks. None of them mentioned the importance of column data types.

What is the best practice of groupby in Spark SQL?

I have a Spark SQL that groupbys multiple columns. I was wondering if the order of the columns matter to the query performance.
Does placing the column with more distinct values earlier help? I assume the groupby is based on some hash/shuffle algorithm. If the first groupby can distribute data to small subsets that can be hold in one machine, the later groupbys can be done locally. Is this true?
What is the best practice of groupby?
group by, as you assumed, uses hash function on columns to decide which set of group by keys would end up in which partition.
You can use distribute by to tell spark which columns to use - https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html
As for any other manipulation on the data (like placing more distinct values earlier), note that if have 2 group by statements in your query, you end up with 2 shuffles. And the result of the first one is obviously quite big (as it's not the final aggregation). So I would try to have as little group by statements as possible.

Cassandra - Same partition key in different tables - when it is right?

I modeled my Cassandra in a way that i have couple of tables with the same partition key - Uuid.
Each table has it's partition key and others column representing data for specific query i would like to ask.
For example - 1 table have Uuid and column regarding it's status (no other clustering keys in this table) and table 2 will contain the same Uuid (Also without clustering keys) but with different columns representing the data for this Uuid.
Is it the right modeling? Is it wrong to duplicate the same partition key around tables in order to group each table to hold relevant column for specific use case? or it preferred to use only 1 table and query them and taking the relevant data for the specific use case in the code?
There's nothing wrong with this modeling. Whether it is better, or worse, than the obvious alternative of having just one table with both pieces of data, depends on your workload:
For example, if you commonly need to read both status and data columns of the same uuid, then these reads will be more efficient if both things are in the same table, which only needs to be looked up once. If you always read just one but not both, then reads will be more efficient from separate tables. Also, if this workload is not read-mostly but rather write-mostly, then writing to just one table instead of two will be more efficient.

Does it help to filter down a dataframe before a left outer join?

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.
TL;DR It is not possible to answer without data, but probably not.
Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:
It has to be highly selective.
Size of the key column is << size of all columns.
The first one is obvious. If there is reduction you search for nothing.
The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.
Using bloom filters might scale more gracefully (no need to shuffle).
If you have 100 fold difference in the number of records, it might be better to consider broadcast join.

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !
Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java
If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.
PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Resources