Spark SQL view and partition column usage - apache-spark

I have a Databricks table (parquet not delta) "TableA" with a partition column "dldate", and it has ~3000 columns.
When I issue select * from TableA where dldate='2022-01-01', the query completes in seconds.
I have a view "view_tableA" which reads from "TableA" and performs some window functions on some of the columns.
When I issue select * from view_tableA where dldate='2022-01-01', the query runs forever.
Will the latter query effectively use the partition key of the table? If not, if there is any optimization I can do to make sure partition key is used?

If partitioning of all window functions is aligned with table partitioning, optimizer will be able to push down the predicate to table level and apply partition pruning.
For example:
SELECT *
FROM (SELECT *, sum(a) over (partition by dldate) FROM TableA)
WHERE dldate = '2022-01-01';
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [dldate#2932, a#2933, sum(a#2933) ...], [dldate#2932]
+- Sort [dldate#2932 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dldate#2932, 200), ...
+- Project [dldate#2932, a#2933]
+- FileScan parquet tablea PartitionFilters: [isnotnull(dldate#2932), (dldate#2932 = 2022-01-01)]
Compare this with a query containing window function not partitioned by dldate:
SELECT *
FROM (SELECT *, sum(a) over (partition by a) FROM TableA)
WHERE dldate = '2022-01-01';
AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(dldate#2968) AND (dldate#2968 = 2022-01-01)) << !!!
+- Window [dldate#2968, a#2969, sum(a#2969) ...], [a#2969]
+- Sort [a#2969 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#2969, 200), ...
+- Project [dldate#2968, a#2969]
+- FileScan parquet tablea PartitionFilters: [] << !!!

Related

Apache Spark: broadcast join behaviour: filtering of joined tables and temp tables

I need to join 2 tables in spark.
But instead of joining 2 tables completely, I first filter out a part of second table:
spark.sql("select * from a join b on a.key=b.key where b.value='xxx' ")
I want to use broadcast join in this case.
Spark has a parameter which defines max table size for broadcast join: spark.sql.autoBroadcastJoinThreshold:
Configures the maximum size in bytes for a table that will be
broadcast to all worker nodes when performing a join. By setting this
value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE COMPUTE STATISTICS noscan has been
run. http://spark.apache.org/docs/2.4.0/sql-performance-tuning.html
I have following questions about this setup:
which table size spark will compare with autoBroadcastJoinThreshold's value: FULL size, or size AFTER applying where clause?
I am assuming that spark will apply where clause BEFORE broadcasting, correct?
the doc says I need to run Hive's Analyze Table command beforehand. How it will work in a case when I am using temp view as a table? As far as I understand I cannot run Analyze Table command against spark's temp view created via dataFrame.createorReplaceTempView("b"). Can I broadcast temp view contents?
Understanding for option 2 is correct.
You can not analyze a TEMP table in spark . Read here
In case you want to take the lead and want to specify the dataframe which you want to broadcast, instead spark decides, can use below snippet-
df = df1.join(F.broadcast(df2),df1.some_col == df2.some_col, "left")
I went ahead and did some small experiments to answer your 1st question.
Question 1 :
created a dataframe a with 3 rows [key,df_a_column]
created a dataframe b with 10 rows [key,value]
ran: spark.sql("SELECT * FROM a JOIN b ON a.key = b.key").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildLeft, false
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#168]
: +- LocalTableScan [key#122, df_a_column#123]
+- *(1) LocalTableScan [key#111, value#112]
As expected the Smaller df a with 3 rows is broadcasted.
Ran : spark.sql("SELECT * FROM a JOIN b ON a.key = b.key where b.value=\"bat\"").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#152]
+- LocalTableScan [key#111, value#112]
Here you can notice the dataframe b is Broadcasted ! meaning spark evaluates the size AFTER applying where for choosing which one to broadcast.
Question 2 :
Yes you are right. It's evident from the previous output it applies where first.
Question 3 :
No you cannot analyse but you can broadcast tempView table by hinting spark about it even in SQL. ref
Example : spark.sql("SELECT /*+ BROADCAST(b) */ * FROM a JOIN b ON a.key = b.key")
And if you see explain now :
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#184]
+- LocalTableScan [key#111, value#112]
Now if you see, dataframe b is broadcasted even though it has 10 rows.
In question 1, without the hint , a was broadcasted .
Note: Broadcast hint in SQL spark is available for 2.2
Tips to understand the physical plan :
Figure out the dataframe from the LocalTableScan[ list of columns ]
The dataframe present under the sub tree/list of BroadcastExchange is being broadcasted.

Exchange on WINDOW operation when the data is already partitioned and exchanged

I have the query below which is EXCHANGING data to perform row_number() over a COLUMN on which the data is already exchanged in both the data sets
the table crime_parquet_table is a Bucketed table on the column Incident_Number with 30 buckets
explain
select
*
from
(
select *,
row_number() over(partition by Incident_Number order by Incident_Number) as rnk
from
(
select *
from
(
select /*+ REPARTITION(30,Incident_Number) */ *
from austin_crime_data_new
union all
select *
from crime_parquet_table
) tmp
)t
) d
where rnk = 1
below is the explain plan.
I believe the +- Exchange (7) is not necessary.
Let me know if you guys think otherwise and the reason for the same.
== Physical Plan ==
* Filter (10)
+- Window (9)
+- * Sort (8)
+- Exchange (7)
+- Union (6)
:- Exchange (3)
: +- * Project (2)
: +- Scan csv (1)
+- * ColumnarToRow (5)
+- Scan parquet default.crime_parquet_table (4)
I got this worked using the below alternative approach
as union all doesn't look at this as a special case . it just treats this as any other tables
So first I saved my intermediate data also as a bucketed table
create table crime_parquet_table2
using parquet
CLUSTERED BY (Incident_Number) INTO 30 BUCKETS
location 'path2'
select /*+ REPARTITION(30,Incident_Number) */ *
from austin_crime_data_new
Now I create a table using LIKE with both paths ( each have bucketed data)
CREATE TABLE crime_data like crime_parquet_table2
location 'path1,path2'
;
Now the sql on this new table gives the expected plan
explain
select *
from
(
select * ,
row_number() over(partition by Incident_Number order by Incident_Number) as rnk
from crime_data
)
where rnk = 1
PLAN
== Physical Plan ==
* Filter (5)
+- Window (4)
+- * Sort (3)
+- * ColumnarToRow (2)
+- Scan parquet default.crime_data (1)
So the extra work is persisting the intermediate data to a storage but shuffle is eliminated

How do group by and window functions interact in Spark SQL?

From this question, I learned that window functions are evaluated after the group by function in PostgresSQL.
I'd like to know what happens when you use a group by and window function in the same query in Spark. I have the same questions as the poster from the previous question:
Are the selected rows grouped first, then considered by the window function ?
Or does the window function execute first, then the resulting values are grouped by the group by?
Something else?
If you have window and group by in same query then
Group by performed first then window function will be applied on the groupby dataset.
You can check query explain plan for more details.
Example:
//sample data
spark.sql("select * from tmp").show()
//+-------+----+
//|trip_id|name|
//+-------+----+
//| 1| a|
//| 2| b|
//+-------+----+
spark.sql("select row_number() over(order by trip_id),trip_id,count(*) cnt from tmp group by trip_id").explain()
//== Physical Plan ==
//*(4) Project [row_number() OVER (ORDER BY trip_id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#150, trip_id#10, cnt#140L]
//+- Window [row_number() windowspecdefinition(trip_id#10 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS row_number() OVER (ORDER BY //trip_id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#150], [trip_id#10 ASC NULLS FIRST]
// +- *(3) Sort [trip_id#10 ASC NULLS FIRST], false, 0
// +- Exchange SinglePartition
// +- *(2) HashAggregate(keys=[trip_id#10], functions=[count(1)])
// +- Exchange hashpartitioning(trip_id#10, 200)
// +- *(1) HashAggregate(keys=[trip_id#10], functions=[partial_count(1)])
// +- LocalTableScan [trip_id#10]
*(2) groupby executed first
*(4) window function applied on the result of grouped dataset.
In case if you have window clause subquery and outer query have group by then subquery(window) executed first then outer query(groupBy) executed next.
Ex:
spark.sql("select trip_id,count(*) from(select *,row_number() over(order by trip_id)rn from tmp)e group by trip_id ").explain()

Spark is sorting already sorted partitions resulting in performance loss

For a cached dataframe, partitioned and sorted within partitions, I get good performance when querying the key with a where clause but bad performance when performing a join with a small table on the same key.
See example dataset dftest below with 10Kx44K = 438M rows.
sqlContext.sql(f'set spark.sql.shuffle.partitions={32}')
sqlContext.clearCache()
sc.setCheckpointDir('/checkpoint/temp')
import datetime
from pyspark.sql.functions import *
from pyspark.sql import Row
start_date = datetime.date(1900, 1, 1)
end_date = datetime.date(2020, 1, 1)
dates = [ start_date + datetime.timedelta(n) for n in range(int ((end_date - start_date).days))]
dfdates=spark.createDataFrame(list(map(lambda x: Row(date=x), dates))) # some dates
dfrange=spark.createDataFrame(list(map(lambda x: Row(number=x), range(10000)))) # some number range
dfjoin = dfrange.crossJoin(dfdates)
dftest = dfjoin.withColumn("random1", round(rand()*(10-5)+5,0)).withColumn("random2", round(rand()*(10-5)+5,0)).withColumn("random3", round(rand()*(10-5)+5,0)).withColumn("random4", round(rand()*(10-5)+5,0)).withColumn("random5", round(rand()*(10-5)+5,0)).checkpoint()
dftest = dftest.repartition("number").sortWithinPartitions("number", "date").cache()
dftest.count() # 438,290,000 rows
The following query now takes roughly a second (on a small cluster with 2 workers):
dftest.where("number = 1000 and date = \"2001-04-04\"").count()
However, when I write a similar condition as a join, it takes 2 minutes:
dfsub = spark.createDataFrame([(10,"1900-01-02",1),
(1000,"2001-04-04",2),
(4000,"2002-05-05",3),
(5000,"1950-06-06",4),
(9875,"1980-07-07",5)],
["number","date", "dummy"]).repartition("number").sortWithinPartitions("number", "date").cache()
df_result = dftest.join(dfsub, ( dftest.number == dfsub.number ) & ( dftest.date == dfsub.date ), 'inner').cache()
df_result.count() # takes 2 minutes (result = 5)
I would have expected this to be roughly equally fast. Especially since I would hope that the larger dataframe is already clustered and cached. Looking at the plan:
== Physical Plan ==
InMemoryTableScan [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797, number#945L, date#946, dummy#947L]
+- InMemoryRelation [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797, number#945L, date#946, dummy#947L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) SortMergeJoin [number#771L, cast(date#769 as string)], [number#945L, date#946], Inner
:- *(1) Sort [number#771L ASC NULLS FIRST, cast(date#769 as string) ASC NULLS FIRST], false, 0
: +- *(1) Filter (isnotnull(number#771L) && isnotnull(date#769))
: +- InMemoryTableScan [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797], [isnotnull(number#771L), isnotnull(date#769)]
: +- InMemoryRelation [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Sort [number#771L ASC NULLS FIRST, date#769 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number#771L, 32)
: +- *(1) Scan ExistingRDD[number#771L,date#769,random1#775,random2#779,random3#784,random4#790,random5#797]
+- *(2) Filter (isnotnull(number#945L) && isnotnull(date#946))
+- InMemoryTableScan [number#945L, date#946, dummy#947L], [isnotnull(number#945L), isnotnull(date#946)]
+- InMemoryRelation [number#945L, date#946, dummy#947L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Sort [number#945L ASC NULLS FIRST, date#946 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number#945L, 32)
+- *(1) Scan ExistingRDD[number#945L,date#946,dummy#947L]
A lot of time seems to be spent sorting the larger dataframe by number and date (this line: Sort [number#771L ASC NULLS FIRST, date#769 ASC NULLS FIRST], false, 0). It leaves me with the following questions:
within the partitions, the sort order for both the left and right side is exactly the same, and optimal for the JOIN clause, why is Spark still sorting the the partitions again?
as the 5 join records match (up to) 5 partitions, why are all partitions evaluated?
It seems Catalyst is not using the info of repartition and sortWithinPartitions of the cached dataframe. Does it make sense to use sortWithinPartitions in cases like these?
Let me try to answer your three questions:
within the partitions, the sort order for both the left and right side is exactly the same, and optimal for the JOIN clause, why is Spark still sorting the the partitions again?
The sort order in both DataFrames is NOT the same, because of different datatypes in your sorting column date, in dfsub it is StringType and in dftest it is DateType, therefore during the join Spark sees that the ordering in both branches is different and thus forces the Sort.
as the 5 join records match (up to) 5 partitions, why are all partitions evaluated?
During the query plan processing Spark does not know how many partitions are non-empty in the small DataFrame and thus it has to process all of them.
It seems Catalyst is not using the info of repartition and sortWithinPartitions of the cached dataframe. Does it make sense to use sortWithinPartitions in cases like these?
Spark optimizer is using the information from repartition and sortWithinPartitions but there are some caveats about how it works. To fix up your query it is also important to repartition by the same columns (both of them) that you are using in the join (not just one column). In principle this should not be necessary and there is a related jira in progress that is trying to solve that.
So here are my proposed changes to your query:
Change the type of date column to StringType in dftest (Or similarly change to DateType in dfsub):
dftest.withColumn("date", col("date").cast('string'))
In both DataFrames change
.repartition("number")
to
.repartition("number", "date")
After these changes you should get a plan like this:
*(3) SortMergeJoin [number#1410L, date#1653], [number#1661L, date#1662], Inner
:- Sort [number#1410L ASC NULLS FIRST, date#1653 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number#1410L, date#1653, 200)
: +- *(1) Project [number#1410L, cast(date#1408 as string) AS date#1653, random1#1540, random2#1544, random3#1549, random4#1555, random5#1562]
: +- *(1) Filter (isnotnull(number#1410L) && isnotnull(cast(date#1408 as string)))
: +- *(1) Scan ExistingRDD[number#1410L,date#1408,random1#1540,random2#1544,random3#1549,random4#1555,random5#1562]
+- Sort [number#1661L ASC NULLS FIRST, date#1662 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number#1661L, date#1662, 200)
+- *(2) Filter (isnotnull(number#1661L) && isnotnull(date#1662))
+- *(2) Scan ExistingRDD[number#1661L,date#1662,dummy#1663L]
so there is only one Exchange and one Sort in each branch of the plan, both of the are coming from the repartition and sortWithinPartition that you call in your transformations and the join does not induce any more sorting or shuffling. Also notice that in my plan there is no InMemoryTableScan, since i did not use cache.

Spark.table() with limit seems to reads whole table

I am trying to read the first 20000 rows of a large table (10bil+ rows) from spark so I use the following lines of code.
df = spark.table("large_table").limit(20000).repartition(1)
df.explain()
And the explain plan looks like this.
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(2) GlobalLimit 20000
+- Exchange SinglePartition
+- *(1) LocalLimit 20000
+- *(1) FileScan parquet large_table[...]
But when I try to write this df into a new table, it seems to kick off an insane number of tasks trying to read the whole table first and then begin writing to the final table! Why does spark not read only the first few files and get the row limit?

Resources