Spark.table() with limit seems to reads whole table - apache-spark

I am trying to read the first 20000 rows of a large table (10bil+ rows) from spark so I use the following lines of code.
df = spark.table("large_table").limit(20000).repartition(1)
df.explain()
And the explain plan looks like this.
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(2) GlobalLimit 20000
+- Exchange SinglePartition
+- *(1) LocalLimit 20000
+- *(1) FileScan parquet large_table[...]
But when I try to write this df into a new table, it seems to kick off an insane number of tasks trying to read the whole table first and then begin writing to the final table! Why does spark not read only the first few files and get the row limit?

Related

Spark SQL view and partition column usage

I have a Databricks table (parquet not delta) "TableA" with a partition column "dldate", and it has ~3000 columns.
When I issue select * from TableA where dldate='2022-01-01', the query completes in seconds.
I have a view "view_tableA" which reads from "TableA" and performs some window functions on some of the columns.
When I issue select * from view_tableA where dldate='2022-01-01', the query runs forever.
Will the latter query effectively use the partition key of the table? If not, if there is any optimization I can do to make sure partition key is used?
If partitioning of all window functions is aligned with table partitioning, optimizer will be able to push down the predicate to table level and apply partition pruning.
For example:
SELECT *
FROM (SELECT *, sum(a) over (partition by dldate) FROM TableA)
WHERE dldate = '2022-01-01';
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Window [dldate#2932, a#2933, sum(a#2933) ...], [dldate#2932]
+- Sort [dldate#2932 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(dldate#2932, 200), ...
+- Project [dldate#2932, a#2933]
+- FileScan parquet tablea PartitionFilters: [isnotnull(dldate#2932), (dldate#2932 = 2022-01-01)]
Compare this with a query containing window function not partitioned by dldate:
SELECT *
FROM (SELECT *, sum(a) over (partition by a) FROM TableA)
WHERE dldate = '2022-01-01';
AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(dldate#2968) AND (dldate#2968 = 2022-01-01)) << !!!
+- Window [dldate#2968, a#2969, sum(a#2969) ...], [a#2969]
+- Sort [a#2969 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#2969, 200), ...
+- Project [dldate#2968, a#2969]
+- FileScan parquet tablea PartitionFilters: [] << !!!

How to get the execution time of EACH operator in Spark SQL

for example, I have a table like:
name | height
mike 80
dan 11
And I have a Spark SQL like "select distinct name from table where height = 80"
In that case, I will get a Physical Plan like this:
== Physical Plan ==
CollectLimit 21
+- HashAggregate(keys=[name#0], functions=[], output=[name#0])
+- Exchange hashpartitioning(name#0, 200)
+- HashAggregate(keys=[name#0], functions=[], output=[name#0])
+- Project [name#0]
+- Filter (isnotnull(height#1L) && (height#1L = 80))
+- Scan ExistingRDD[name#0,height#1L]
Here is the problem, I'd like to get the execution time of EACH operator (such as Project, Filter...), so I check the SparkUI, it seems that Spark records execution time for each operator(figure below).
Click here to view the picture
I wish to get those results in my code.
So how can I get what I need?

Apache Spark: broadcast join behaviour: filtering of joined tables and temp tables

I need to join 2 tables in spark.
But instead of joining 2 tables completely, I first filter out a part of second table:
spark.sql("select * from a join b on a.key=b.key where b.value='xxx' ")
I want to use broadcast join in this case.
Spark has a parameter which defines max table size for broadcast join: spark.sql.autoBroadcastJoinThreshold:
Configures the maximum size in bytes for a table that will be
broadcast to all worker nodes when performing a join. By setting this
value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE COMPUTE STATISTICS noscan has been
run. http://spark.apache.org/docs/2.4.0/sql-performance-tuning.html
I have following questions about this setup:
which table size spark will compare with autoBroadcastJoinThreshold's value: FULL size, or size AFTER applying where clause?
I am assuming that spark will apply where clause BEFORE broadcasting, correct?
the doc says I need to run Hive's Analyze Table command beforehand. How it will work in a case when I am using temp view as a table? As far as I understand I cannot run Analyze Table command against spark's temp view created via dataFrame.createorReplaceTempView("b"). Can I broadcast temp view contents?
Understanding for option 2 is correct.
You can not analyze a TEMP table in spark . Read here
In case you want to take the lead and want to specify the dataframe which you want to broadcast, instead spark decides, can use below snippet-
df = df1.join(F.broadcast(df2),df1.some_col == df2.some_col, "left")
I went ahead and did some small experiments to answer your 1st question.
Question 1 :
created a dataframe a with 3 rows [key,df_a_column]
created a dataframe b with 10 rows [key,value]
ran: spark.sql("SELECT * FROM a JOIN b ON a.key = b.key").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildLeft, false
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#168]
: +- LocalTableScan [key#122, df_a_column#123]
+- *(1) LocalTableScan [key#111, value#112]
As expected the Smaller df a with 3 rows is broadcasted.
Ran : spark.sql("SELECT * FROM a JOIN b ON a.key = b.key where b.value=\"bat\"").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#152]
+- LocalTableScan [key#111, value#112]
Here you can notice the dataframe b is Broadcasted ! meaning spark evaluates the size AFTER applying where for choosing which one to broadcast.
Question 2 :
Yes you are right. It's evident from the previous output it applies where first.
Question 3 :
No you cannot analyse but you can broadcast tempView table by hinting spark about it even in SQL. ref
Example : spark.sql("SELECT /*+ BROADCAST(b) */ * FROM a JOIN b ON a.key = b.key")
And if you see explain now :
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#184]
+- LocalTableScan [key#111, value#112]
Now if you see, dataframe b is broadcasted even though it has 10 rows.
In question 1, without the hint , a was broadcasted .
Note: Broadcast hint in SQL spark is available for 2.2
Tips to understand the physical plan :
Figure out the dataframe from the LocalTableScan[ list of columns ]
The dataframe present under the sub tree/list of BroadcastExchange is being broadcasted.

Spark is sorting already sorted partitions resulting in performance loss

For a cached dataframe, partitioned and sorted within partitions, I get good performance when querying the key with a where clause but bad performance when performing a join with a small table on the same key.
See example dataset dftest below with 10Kx44K = 438M rows.
sqlContext.sql(f'set spark.sql.shuffle.partitions={32}')
sqlContext.clearCache()
sc.setCheckpointDir('/checkpoint/temp')
import datetime
from pyspark.sql.functions import *
from pyspark.sql import Row
start_date = datetime.date(1900, 1, 1)
end_date = datetime.date(2020, 1, 1)
dates = [ start_date + datetime.timedelta(n) for n in range(int ((end_date - start_date).days))]
dfdates=spark.createDataFrame(list(map(lambda x: Row(date=x), dates))) # some dates
dfrange=spark.createDataFrame(list(map(lambda x: Row(number=x), range(10000)))) # some number range
dfjoin = dfrange.crossJoin(dfdates)
dftest = dfjoin.withColumn("random1", round(rand()*(10-5)+5,0)).withColumn("random2", round(rand()*(10-5)+5,0)).withColumn("random3", round(rand()*(10-5)+5,0)).withColumn("random4", round(rand()*(10-5)+5,0)).withColumn("random5", round(rand()*(10-5)+5,0)).checkpoint()
dftest = dftest.repartition("number").sortWithinPartitions("number", "date").cache()
dftest.count() # 438,290,000 rows
The following query now takes roughly a second (on a small cluster with 2 workers):
dftest.where("number = 1000 and date = \"2001-04-04\"").count()
However, when I write a similar condition as a join, it takes 2 minutes:
dfsub = spark.createDataFrame([(10,"1900-01-02",1),
(1000,"2001-04-04",2),
(4000,"2002-05-05",3),
(5000,"1950-06-06",4),
(9875,"1980-07-07",5)],
["number","date", "dummy"]).repartition("number").sortWithinPartitions("number", "date").cache()
df_result = dftest.join(dfsub, ( dftest.number == dfsub.number ) & ( dftest.date == dfsub.date ), 'inner').cache()
df_result.count() # takes 2 minutes (result = 5)
I would have expected this to be roughly equally fast. Especially since I would hope that the larger dataframe is already clustered and cached. Looking at the plan:
== Physical Plan ==
InMemoryTableScan [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797, number#945L, date#946, dummy#947L]
+- InMemoryRelation [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797, number#945L, date#946, dummy#947L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) SortMergeJoin [number#771L, cast(date#769 as string)], [number#945L, date#946], Inner
:- *(1) Sort [number#771L ASC NULLS FIRST, cast(date#769 as string) ASC NULLS FIRST], false, 0
: +- *(1) Filter (isnotnull(number#771L) && isnotnull(date#769))
: +- InMemoryTableScan [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797], [isnotnull(number#771L), isnotnull(date#769)]
: +- InMemoryRelation [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Sort [number#771L ASC NULLS FIRST, date#769 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number#771L, 32)
: +- *(1) Scan ExistingRDD[number#771L,date#769,random1#775,random2#779,random3#784,random4#790,random5#797]
+- *(2) Filter (isnotnull(number#945L) && isnotnull(date#946))
+- InMemoryTableScan [number#945L, date#946, dummy#947L], [isnotnull(number#945L), isnotnull(date#946)]
+- InMemoryRelation [number#945L, date#946, dummy#947L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Sort [number#945L ASC NULLS FIRST, date#946 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number#945L, 32)
+- *(1) Scan ExistingRDD[number#945L,date#946,dummy#947L]
A lot of time seems to be spent sorting the larger dataframe by number and date (this line: Sort [number#771L ASC NULLS FIRST, date#769 ASC NULLS FIRST], false, 0). It leaves me with the following questions:
within the partitions, the sort order for both the left and right side is exactly the same, and optimal for the JOIN clause, why is Spark still sorting the the partitions again?
as the 5 join records match (up to) 5 partitions, why are all partitions evaluated?
It seems Catalyst is not using the info of repartition and sortWithinPartitions of the cached dataframe. Does it make sense to use sortWithinPartitions in cases like these?
Let me try to answer your three questions:
within the partitions, the sort order for both the left and right side is exactly the same, and optimal for the JOIN clause, why is Spark still sorting the the partitions again?
The sort order in both DataFrames is NOT the same, because of different datatypes in your sorting column date, in dfsub it is StringType and in dftest it is DateType, therefore during the join Spark sees that the ordering in both branches is different and thus forces the Sort.
as the 5 join records match (up to) 5 partitions, why are all partitions evaluated?
During the query plan processing Spark does not know how many partitions are non-empty in the small DataFrame and thus it has to process all of them.
It seems Catalyst is not using the info of repartition and sortWithinPartitions of the cached dataframe. Does it make sense to use sortWithinPartitions in cases like these?
Spark optimizer is using the information from repartition and sortWithinPartitions but there are some caveats about how it works. To fix up your query it is also important to repartition by the same columns (both of them) that you are using in the join (not just one column). In principle this should not be necessary and there is a related jira in progress that is trying to solve that.
So here are my proposed changes to your query:
Change the type of date column to StringType in dftest (Or similarly change to DateType in dfsub):
dftest.withColumn("date", col("date").cast('string'))
In both DataFrames change
.repartition("number")
to
.repartition("number", "date")
After these changes you should get a plan like this:
*(3) SortMergeJoin [number#1410L, date#1653], [number#1661L, date#1662], Inner
:- Sort [number#1410L ASC NULLS FIRST, date#1653 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number#1410L, date#1653, 200)
: +- *(1) Project [number#1410L, cast(date#1408 as string) AS date#1653, random1#1540, random2#1544, random3#1549, random4#1555, random5#1562]
: +- *(1) Filter (isnotnull(number#1410L) && isnotnull(cast(date#1408 as string)))
: +- *(1) Scan ExistingRDD[number#1410L,date#1408,random1#1540,random2#1544,random3#1549,random4#1555,random5#1562]
+- Sort [number#1661L ASC NULLS FIRST, date#1662 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number#1661L, date#1662, 200)
+- *(2) Filter (isnotnull(number#1661L) && isnotnull(date#1662))
+- *(2) Scan ExistingRDD[number#1661L,date#1662,dummy#1663L]
so there is only one Exchange and one Sort in each branch of the plan, both of the are coming from the repartition and sortWithinPartition that you call in your transformations and the join does not induce any more sorting or shuffling. Also notice that in my plan there is no InMemoryTableScan, since i did not use cache.

question about the count method in spark dataset?

I was reading the book 'spark definitive guide'
It has an example like below.
val myRange = spark.range(1000).toDF("number")
val divisBy2 = myRange.where("number % 2 = 0")
divisBy2.count()
Below is the description for the three lines of code.
we started a Spark job that runs our filter transformation (a narrow
transformation), then an aggregation (a wide transformation) that performs the counts on a per
partition basis, and then a collect, which brings our result to a native object in the respective
language
I know the count is an action not a transformation, since it return an actual value and I can not call 'explain' on the return value of count.
But I was wondering why the count will cause the wide transformation, how can I know the execution plan of this count in tis case since I can not invoke the 'explain' after count
Thanks.
updated:
This image is the spark ui screenshot, I take it from databricks notebook,
I said there is a shuffle write and read operation, does it mean that there is a wide transformation?
Here is the execution plan:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#7L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#10L])
+- *(1) Project
+- *(1) Filter ((id#0L % 2) = 0)
+- *(1) Range (0, 1000, step=1, splits=8)
What we can see here:
Counting made inside each partition
All partitions merged into the single one
Final counting made

Resources