Power BI/Query avoid materialization of join results before GROUP BY - apache-spark

Objective
I'm using Power BI Desktop, DirectQuery to Spark cluster. I want to join two tables, and aggregate based on MONTH and DEP_NAME columns. Facts table is 10GB+ (contains MONTH col), while Department table is about few KBs (contains DEP_ID, DEP_NAME cols). The expected result is very small, about 100 rows.
Issue
Spark fails due to the following exception:
DataSource.Error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error
from server: error code: '0' error message: 'Error running query:
org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 10 tasks (4.1 GB) is bigger than
spark.driver.maxResultSize (4.0 GB)'.
I'm pretty sure Power BI tries to materialize join result (10GB+), before applying aggregation.
Question
Is there any way to make Power BI not execute/materialize join results without applying aggregation?
Power Query
let
Source = ApacheSpark.Tables("https://xxxxxxxxx.azuredatabricks.net:443/sql/protocolv1/o/yyyyyyyyyyy", 2, [BatchSize=null]),
#"Result" = Table.Group(
Table.Join(
Source{[Schema="default",Item="Fact"]}[Data],
"DEP_ID",
Table.RenameColumns(Source{[Schema="default",Item="Department"]}[Data], {"DEP_ID", "DEP_ID_"}),
"DEP_ID_",
JoinKind.Inner
),
{"MONTH", "DEP_NAME"},
{{"TOTAL_SALARY", each List.Sum([SALARY]), type number}}
)
in
#"Result"
Power Query failed job execution plan
From Spark SQL execution plan you can see that there is no aggregation step, only join! I think Power BI try to load join results (10GB+) through Spark Driver before applying GROUP BY aggregation.
Expected execution plan
I can write the same job with PySpark:
dep = spark.read.csv(dep_path)
spark.read.parquet(fact_pat) \
.join(F.broadcast(dep), ['DEP_ID']) \
.groupBy('MONTH', 'DEP_NAME') \
.agg(F.sum('SALARY')) \
.show(1000)
The plan will be the following (pay attention to hash aggregate steps at the end):
P.S.
AFAIK, Power BI Desktop "View Native Query" is disabled for Spark DirectQuery.
UPD
Looks like the issue isn't in Query Folding, Power BI for some reason materialize the table before GROUP BY even without Join. The following query leads to full table load:
let
Source = ApacheSpark.Tables("https://xxxxxxxx.azuredatabricks.net:443/sql/protocolv1/o/yyyyyyyyyyyy", 2, [BatchSize=null]),
#"Result" = Table.Group(
Source{[Schema="default",Item="Fact"]}[Data],
{"MONTH", "DEP_ID"},
{{"TOTAL_SALARY", each List.Sum([SALARY]), type number}}
)
in
#"Result"
Still, full load happends only in case of List.Sum function. List.Count and List.Max works well, even with table join before GROUP BY.

Instead of joining then grouping, maybe you could do the reverse. Group by MONTH and DEP_ID from your Fact table and then join with the Department table to get the DEP_NAME.
Note: If multiple DEP_ID have the same DEP_NAME, you'll need to do one more group by after joining.

Related

Spark SQL performance issue on large table

We are connecting Tera data from spark SQL with below API
Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB.
This below information we got from DBA. We dont have any logs on SPARK SQL.
SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
1
1
1
1
1
1
1
1
1
Can you please clarify why this query is executing or is there any chance that this type of query is executing from our code it self while check for rows count from dataframe.
Please provide me your inputs on this.
I searched on various mailing list and even created sapk question.

Spark request only a partial sorting for row_number().over partitioned window

Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.

Spark Join optimization

let's say I have two dataframes that I want to join using "inner join": A and B, each one has 100 columns and billions of rows.
If in my use case I'm only interested in 10 columns of A and 4 columns of B, does Spark do the optimization for me in order to handle this and shuffle only 14 columns or will he be shuffling everything then selecting 14 columns?
Query 1 :
A_select = A.select("{10 columns}").as("A")
B_select = B.select("{4 columns}").as("B")
result = A_select.join(B_select, $"A.id"==$"B.id")
Query 2 :
A.join(B, $"A.id"==$"B.id").select("{14 columns}")
Is Query1==Query2 in termes of behavior, execution time, data shuffling ?
Thanks in advance for your answers :
Yes, spark will handle the optimization for you. Due to it's lazy evaluation behaviour only the required attributes will be selected from the datafrmes (A and B).
You can use explain function to view logical/physical plan,
result.explain()
Both the query will be returning same physical plan. Hence execution time and data shuffling will be same.
Reference - Pyspark documentation for explain function.

Spark isolate time taken for reading from external DB

I am looking to measure how much time is taken in my Spark job for the IO part of reading from an external DB. My code is
val query = s"""
|(
| select
| ...
|) as project_data_tmp """.stripMargin
sparkSession.time(
sparkSession.read.jdbc(
url = msqlURLWithCreds,
table = query,
new Properties()
)
)
sparkSession.time doesn't seem to do anything in-depth enough to measure the full load time of the sql.
The web UI is giving me timing for the entire Stage
The green box is my `read and call cache on the DataFrame.
The only way I could come up with to split into a separate Stage was to perform an operation that required shuffling data; but then that introduced its own overheads.
Thanks,
Brent

Error while running range query on multiple clustering columns using spark cassandra connector:

Following is the cassandra table schema :
CREATE TABLE my_table (
year text,
month text,
day text,
hour int,
min int,
sec int,
PRIMARY KEY ((year, month, day), hour, min, sec) )
If i run following query using cassandra cql it works:
SELECT * FROM my_table WHERE year ='2017' and month ='01' and day ='16' and (hour,min,sec) > (1,15,0) LIMIT 200
However, when i run same query using spark-cassandra connector it does not work:
sparkSession.read().format("org.apache.spark.sql.cassandra").options(map).load()
.where(year ='2017' and month ='01' and day ='16' and (hour,min,sec) >= (1,15,0)");
I am getting following exception in logs:
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> cannot resolve '(struct(`hour`, `min`, `sec`) >= struct(1, 15, 0))'
> due to data type mismatch: differing types in '(struct(`hour`, `min`,
> `sec`) >= struct(1, 15, 0))' and (struct<hour:int,min:int,sec:int>
> struct<col1:int,col2:int,col3:int>).; line 1 pos 96
Spark-cassandra-connector version:2.0.0-M3
Spark-version:2.0.0
Any help is much appreciated
Quite simply CQL is not Spark Sql or Catalyst compatible. What you are seeing is a conflict in syntax.
This where clause :
.where(year ='2017' and month ='01' and day ='16' and (hour,min,sec) >= (1,15,0)
Is not directly pushed down to Cassandra. Instead it is being transformed into catalyst predicates. This is where you have a problem
Cataylst sees this
(hour,min,sec) >= (1,15,0)
And tries to make types for them
The left hand side becomes
struct<hour:int,min:int,sec:int>
The right hand side becomes
struct<col1:int,col2:int,col3:int>
These are not tuples, but explicitly typed structs. They cannot be directly compared hence your error. In the DataFrame api you would just define a new Struct with the correct types and make a literal of that but I'm not sure how to express that in SparkSQL.
Regardless this tuple predicate will not be pushed down to Cassandra. The Struct you are defining of hour, min, sec is going to be hidden from Cassandra because the underlying table doesn't provide a Struct<hour, min, sec> which means that Spark thinks it needs to generate that after pulling the data from Cassandra.
You are better off just using the separate clauses with AND as mentioned by
#AkashSethi

Resources