How should be the joining strategy for multiple tables in Spark - apache-spark

I have to perform join between 3 tables in Spark, table 1: ~ 2 billion records, table 2: ~ 100 million records and table3: ~ 200 million. What should be the order of execution. Can CBO take care of it automatically?

Related

If multiple dataframes are repartitioned based on the same column having same values, will they be colocated?

I have 10 DataFrames and they all have a common column comm_col with distinct count of that column, lets say 100 for each DataFrame.
If I repartition those 10 tables based on comm_col what are the chances that the partitions with the same values for each DataFrame is colocated in the same executor?
I am interested in colocation because I want to join all 10 tables and repartition would really break down the table into common chunks thus making other parts of the table irrelevant.
Lets say I have 5 executors with 3 cores and 10 gigs of memory.
Each DataFrame is about roughly 20 mill rows.

A simple left join query taking lot of time for output

In Azure SYNAPSE I have two tables table A with 6 millions of records and Table B with 2 millions when I run a simple left join query it takes around 20 minutes to execute but when I run same join query in On premises SQL SERVER it gives output in 1 sec. I have round robin distribution in synapse and columns are indexed, What can be the reason for this issue?

Spark Generate A Lot Of Tasks Although Partition Number is 1 Pyspark

My code:
df = self.sql_context.sql(f"select max(id) as id from {table}")
return df.collect()[0][0]
My table is partitioned by id - it has 100M records but only 3 distinct id's.
I expected this query to work with 1 task and scan just the partition column (id).
I don't understand how I have 691 tasks for the collect line with just 3 partitions
I guess the query is executing full scan on the table but I can't figure why it doesn't scan just the metadata
Your df contains the result of an aggregation on the entire table, it contains only one row (with only one field being the max(id)), this is why it has only 1 partition.
But the original table DataFrame may have many partitions (or only 1 partition but its computation needs ~600 stages, triggering 1 task per stage, which is not that common)
Without details on your parallelism configurations and input source type and transformations, it is not easy to help more !

How to check the number of partitions that a table is distributed in a DolphinDB database?

How can I know over how many partitions in a DolphinDB database that a table is distributed? For example, if I created a database with 100 partitions and a table in the database only has data in 4 partitions, how do I get the number of 4?
this will do:
sqlDS(<select * from t>).size()

Optimizing spark sql Joins

I have a requirement to perform multiple joins in a Spark Sql job.. I have three data sources, lets call them a, b, c respectively. Each of them are read from S3 in plaintext CSV.
'a' and 'b' are large tables, with around 90m and 50m records respectively. Table 'c' is small, with just about 20k records.
The operation I am performing is
(a left join b where <filter something from a> group by <a.1,a.2,a.3,a.4>)
left join c
The final op has around 620k records which is supposed to be the same as the number of filtered and grouped records in 'a' (the 'where' and 'group by' clauses above considerably reduce the amount of data from table 'a').
The operation is currently taking around 17 mins on my fairly powerful cluster. I would like to know if this can be reduced through optimization. Please advise.
Best Regards

Resources