Performing join operation on two huge tables using apache spark - apache-spark

I have 2 tables in my database. Each table is having 100 million rows.
Is there a way to join these 2 tables and extract data using apache spark in fastest way ?

I would say the most efficient way would be to use DataFrames and call join, followed by any other criteria. The benefit is that certain filters or selections will be pushed as far down as possible to cut down your network load...only the data that is needed will be pulled.
Without more information, that is the best suggestion I can give.

Related

What is the best practice of groupby in Spark SQL?

I have a Spark SQL that groupbys multiple columns. I was wondering if the order of the columns matter to the query performance.
Does placing the column with more distinct values earlier help? I assume the groupby is based on some hash/shuffle algorithm. If the first groupby can distribute data to small subsets that can be hold in one machine, the later groupbys can be done locally. Is this true?
What is the best practice of groupby?
group by, as you assumed, uses hash function on columns to decide which set of group by keys would end up in which partition.
You can use distribute by to tell spark which columns to use - https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html
As for any other manipulation on the data (like placing more distinct values earlier), note that if have 2 group by statements in your query, you end up with 2 shuffles. And the result of the first one is obviously quite big (as it's not the final aggregation). So I would try to have as little group by statements as possible.

How do I create index on pyspark df?

I have bunch of hive tables.
I want to:
Pull the tables into a pyspark DF.
Do a UDF on them.
Join 4 tables based on customer id.
Is there a concept of indexing in spark to speed up the operation?
If so whats the command?
How do I create index on dataframe?
I understand your problem but the thing is, you acquire the data at the same time you process them. Therefore, calculating an index before joining is useless as it will take take more time to first create the index.
If you have several write operation, you may want to cache your data to speed up but otherwise, the index is not the solution to investigate.
There is maybe another thing you can try : df.repartition.
This will create partition on your df according to one column. But I have no idea if it can help.

Does it help to filter down a dataframe before a left outer join?

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.
TL;DR It is not possible to answer without data, but probably not.
Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:
It has to be highly selective.
Size of the key column is << size of all columns.
The first one is obvious. If there is reduction you search for nothing.
The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.
Using bloom filters might scale more gracefully (no need to shuffle).
If you have 100 fold difference in the number of records, it might be better to consider broadcast join.

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !
Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java
If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.
PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

How to get sorted counters from Cassandra

I have a row of counters and I want to get its' columns sorted by values. Is there any strategies or data models for that?
I'm afraid there is no way of having Cassandra do this for you; you will need to get the entire row from Cassandra (paging for large rows) and sort it in the client.
You could use a periodic MapReduce job to do this for you, and cache the result of the job back into Cassandra, if your solution can cope with non-uptodate results.

Resources