question about the count method in spark dataset? - apache-spark

I was reading the book 'spark definitive guide'
It has an example like below.
val myRange = spark.range(1000).toDF("number")
val divisBy2 = myRange.where("number % 2 = 0")
divisBy2.count()
Below is the description for the three lines of code.
we started a Spark job that runs our filter transformation (a narrow
transformation), then an aggregation (a wide transformation) that performs the counts on a per
partition basis, and then a collect, which brings our result to a native object in the respective
language
I know the count is an action not a transformation, since it return an actual value and I can not call 'explain' on the return value of count.
But I was wondering why the count will cause the wide transformation, how can I know the execution plan of this count in tis case since I can not invoke the 'explain' after count
Thanks.
updated:
This image is the spark ui screenshot, I take it from databricks notebook,
I said there is a shuffle write and read operation, does it mean that there is a wide transformation?

Here is the execution plan:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#7L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#10L])
+- *(1) Project
+- *(1) Filter ((id#0L % 2) = 0)
+- *(1) Range (0, 1000, step=1, splits=8)
What we can see here:
Counting made inside each partition
All partitions merged into the single one
Final counting made

Related

How to get the execution time of EACH operator in Spark SQL

for example, I have a table like:
name | height
mike 80
dan 11
And I have a Spark SQL like "select distinct name from table where height = 80"
In that case, I will get a Physical Plan like this:
== Physical Plan ==
CollectLimit 21
+- HashAggregate(keys=[name#0], functions=[], output=[name#0])
+- Exchange hashpartitioning(name#0, 200)
+- HashAggregate(keys=[name#0], functions=[], output=[name#0])
+- Project [name#0]
+- Filter (isnotnull(height#1L) && (height#1L = 80))
+- Scan ExistingRDD[name#0,height#1L]
Here is the problem, I'd like to get the execution time of EACH operator (such as Project, Filter...), so I check the SparkUI, it seems that Spark records execution time for each operator(figure below).
Click here to view the picture
I wish to get those results in my code.
So how can I get what I need?

Apache Spark: broadcast join behaviour: filtering of joined tables and temp tables

I need to join 2 tables in spark.
But instead of joining 2 tables completely, I first filter out a part of second table:
spark.sql("select * from a join b on a.key=b.key where b.value='xxx' ")
I want to use broadcast join in this case.
Spark has a parameter which defines max table size for broadcast join: spark.sql.autoBroadcastJoinThreshold:
Configures the maximum size in bytes for a table that will be
broadcast to all worker nodes when performing a join. By setting this
value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE COMPUTE STATISTICS noscan has been
run. http://spark.apache.org/docs/2.4.0/sql-performance-tuning.html
I have following questions about this setup:
which table size spark will compare with autoBroadcastJoinThreshold's value: FULL size, or size AFTER applying where clause?
I am assuming that spark will apply where clause BEFORE broadcasting, correct?
the doc says I need to run Hive's Analyze Table command beforehand. How it will work in a case when I am using temp view as a table? As far as I understand I cannot run Analyze Table command against spark's temp view created via dataFrame.createorReplaceTempView("b"). Can I broadcast temp view contents?
Understanding for option 2 is correct.
You can not analyze a TEMP table in spark . Read here
In case you want to take the lead and want to specify the dataframe which you want to broadcast, instead spark decides, can use below snippet-
df = df1.join(F.broadcast(df2),df1.some_col == df2.some_col, "left")
I went ahead and did some small experiments to answer your 1st question.
Question 1 :
created a dataframe a with 3 rows [key,df_a_column]
created a dataframe b with 10 rows [key,value]
ran: spark.sql("SELECT * FROM a JOIN b ON a.key = b.key").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildLeft, false
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#168]
: +- LocalTableScan [key#122, df_a_column#123]
+- *(1) LocalTableScan [key#111, value#112]
As expected the Smaller df a with 3 rows is broadcasted.
Ran : spark.sql("SELECT * FROM a JOIN b ON a.key = b.key where b.value=\"bat\"").explain()
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#152]
+- LocalTableScan [key#111, value#112]
Here you can notice the dataframe b is Broadcasted ! meaning spark evaluates the size AFTER applying where for choosing which one to broadcast.
Question 2 :
Yes you are right. It's evident from the previous output it applies where first.
Question 3 :
No you cannot analyse but you can broadcast tempView table by hinting spark about it even in SQL. ref
Example : spark.sql("SELECT /*+ BROADCAST(b) */ * FROM a JOIN b ON a.key = b.key")
And if you see explain now :
== Physical Plan ==
*(1) BroadcastHashJoin [key#122], [key#111], Inner, BuildRight, false
:- *(1) LocalTableScan [key#122, df_a_column#123]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#184]
+- LocalTableScan [key#111, value#112]
Now if you see, dataframe b is broadcasted even though it has 10 rows.
In question 1, without the hint , a was broadcasted .
Note: Broadcast hint in SQL spark is available for 2.2
Tips to understand the physical plan :
Figure out the dataframe from the LocalTableScan[ list of columns ]
The dataframe present under the sub tree/list of BroadcastExchange is being broadcasted.

Spark is sorting already sorted partitions resulting in performance loss

For a cached dataframe, partitioned and sorted within partitions, I get good performance when querying the key with a where clause but bad performance when performing a join with a small table on the same key.
See example dataset dftest below with 10Kx44K = 438M rows.
sqlContext.sql(f'set spark.sql.shuffle.partitions={32}')
sqlContext.clearCache()
sc.setCheckpointDir('/checkpoint/temp')
import datetime
from pyspark.sql.functions import *
from pyspark.sql import Row
start_date = datetime.date(1900, 1, 1)
end_date = datetime.date(2020, 1, 1)
dates = [ start_date + datetime.timedelta(n) for n in range(int ((end_date - start_date).days))]
dfdates=spark.createDataFrame(list(map(lambda x: Row(date=x), dates))) # some dates
dfrange=spark.createDataFrame(list(map(lambda x: Row(number=x), range(10000)))) # some number range
dfjoin = dfrange.crossJoin(dfdates)
dftest = dfjoin.withColumn("random1", round(rand()*(10-5)+5,0)).withColumn("random2", round(rand()*(10-5)+5,0)).withColumn("random3", round(rand()*(10-5)+5,0)).withColumn("random4", round(rand()*(10-5)+5,0)).withColumn("random5", round(rand()*(10-5)+5,0)).checkpoint()
dftest = dftest.repartition("number").sortWithinPartitions("number", "date").cache()
dftest.count() # 438,290,000 rows
The following query now takes roughly a second (on a small cluster with 2 workers):
dftest.where("number = 1000 and date = \"2001-04-04\"").count()
However, when I write a similar condition as a join, it takes 2 minutes:
dfsub = spark.createDataFrame([(10,"1900-01-02",1),
(1000,"2001-04-04",2),
(4000,"2002-05-05",3),
(5000,"1950-06-06",4),
(9875,"1980-07-07",5)],
["number","date", "dummy"]).repartition("number").sortWithinPartitions("number", "date").cache()
df_result = dftest.join(dfsub, ( dftest.number == dfsub.number ) & ( dftest.date == dfsub.date ), 'inner').cache()
df_result.count() # takes 2 minutes (result = 5)
I would have expected this to be roughly equally fast. Especially since I would hope that the larger dataframe is already clustered and cached. Looking at the plan:
== Physical Plan ==
InMemoryTableScan [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797, number#945L, date#946, dummy#947L]
+- InMemoryRelation [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797, number#945L, date#946, dummy#947L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) SortMergeJoin [number#771L, cast(date#769 as string)], [number#945L, date#946], Inner
:- *(1) Sort [number#771L ASC NULLS FIRST, cast(date#769 as string) ASC NULLS FIRST], false, 0
: +- *(1) Filter (isnotnull(number#771L) && isnotnull(date#769))
: +- InMemoryTableScan [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797], [isnotnull(number#771L), isnotnull(date#769)]
: +- InMemoryRelation [number#771L, date#769, random1#775, random2#779, random3#784, random4#790, random5#797], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Sort [number#771L ASC NULLS FIRST, date#769 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number#771L, 32)
: +- *(1) Scan ExistingRDD[number#771L,date#769,random1#775,random2#779,random3#784,random4#790,random5#797]
+- *(2) Filter (isnotnull(number#945L) && isnotnull(date#946))
+- InMemoryTableScan [number#945L, date#946, dummy#947L], [isnotnull(number#945L), isnotnull(date#946)]
+- InMemoryRelation [number#945L, date#946, dummy#947L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Sort [number#945L ASC NULLS FIRST, date#946 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number#945L, 32)
+- *(1) Scan ExistingRDD[number#945L,date#946,dummy#947L]
A lot of time seems to be spent sorting the larger dataframe by number and date (this line: Sort [number#771L ASC NULLS FIRST, date#769 ASC NULLS FIRST], false, 0). It leaves me with the following questions:
within the partitions, the sort order for both the left and right side is exactly the same, and optimal for the JOIN clause, why is Spark still sorting the the partitions again?
as the 5 join records match (up to) 5 partitions, why are all partitions evaluated?
It seems Catalyst is not using the info of repartition and sortWithinPartitions of the cached dataframe. Does it make sense to use sortWithinPartitions in cases like these?
Let me try to answer your three questions:
within the partitions, the sort order for both the left and right side is exactly the same, and optimal for the JOIN clause, why is Spark still sorting the the partitions again?
The sort order in both DataFrames is NOT the same, because of different datatypes in your sorting column date, in dfsub it is StringType and in dftest it is DateType, therefore during the join Spark sees that the ordering in both branches is different and thus forces the Sort.
as the 5 join records match (up to) 5 partitions, why are all partitions evaluated?
During the query plan processing Spark does not know how many partitions are non-empty in the small DataFrame and thus it has to process all of them.
It seems Catalyst is not using the info of repartition and sortWithinPartitions of the cached dataframe. Does it make sense to use sortWithinPartitions in cases like these?
Spark optimizer is using the information from repartition and sortWithinPartitions but there are some caveats about how it works. To fix up your query it is also important to repartition by the same columns (both of them) that you are using in the join (not just one column). In principle this should not be necessary and there is a related jira in progress that is trying to solve that.
So here are my proposed changes to your query:
Change the type of date column to StringType in dftest (Or similarly change to DateType in dfsub):
dftest.withColumn("date", col("date").cast('string'))
In both DataFrames change
.repartition("number")
to
.repartition("number", "date")
After these changes you should get a plan like this:
*(3) SortMergeJoin [number#1410L, date#1653], [number#1661L, date#1662], Inner
:- Sort [number#1410L ASC NULLS FIRST, date#1653 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(number#1410L, date#1653, 200)
: +- *(1) Project [number#1410L, cast(date#1408 as string) AS date#1653, random1#1540, random2#1544, random3#1549, random4#1555, random5#1562]
: +- *(1) Filter (isnotnull(number#1410L) && isnotnull(cast(date#1408 as string)))
: +- *(1) Scan ExistingRDD[number#1410L,date#1408,random1#1540,random2#1544,random3#1549,random4#1555,random5#1562]
+- Sort [number#1661L ASC NULLS FIRST, date#1662 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(number#1661L, date#1662, 200)
+- *(2) Filter (isnotnull(number#1661L) && isnotnull(date#1662))
+- *(2) Scan ExistingRDD[number#1661L,date#1662,dummy#1663L]
so there is only one Exchange and one Sort in each branch of the plan, both of the are coming from the repartition and sortWithinPartition that you call in your transformations and the join does not induce any more sorting or shuffling. Also notice that in my plan there is no InMemoryTableScan, since i did not use cache.

How does count distinct work in Apache spark SQL

I am trying to count distinct number of entities at different date ranges.
I need to understand how spark performs this operation
val distinct_daily_cust_12month = sqlContext.sql(s"select distinct day_id,txn_type,customer_id from ${db_name}.fact_customer where day_id>='${start_last_12month}' and day_id<='${start_date}' and txn_type not in (6,99)")
val category_mapping = sqlContext.sql(s"select * from datalake.category_mapping");
val daily_cust_12month_ds =distinct_daily_cust_12month.join(broadcast(category_mapping),distinct_daily_cust_12month("txn_type")===category_mapping("id")).select("category","sub_category","customer_id","day_id")
daily_cust_12month_ds.createOrReplaceTempView("daily_cust_12month_ds")
val total_cust_metrics = sqlContext.sql(s"""select 'total' as category,
count(distinct(case when day_id='${start_date}' then customer_id end)) as yest,
count(distinct(case when day_id>='${start_week}' and day_id<='${end_week}' then customer_id end)) as week,
count(distinct(case when day_id>='${start_month}' and day_id<='${start_date}' then customer_id end)) as mtd,
count(distinct(case when day_id>='${start_last_month}' and day_id<='${end_last_month}' then customer_id end)) as ltd,
count(distinct(case when day_id>='${start_last_6month}' and day_id<='${start_date}' then customer_id end)) as lsm,
count(distinct(case when day_id>='${start_last_12month}' and day_id<='${start_date}' then customer_id end)) as ltm
from daily_cust_12month_ds
""")
No Errors, But this is taking a lot of time. I want to know if there is a better way to do this in Spark
Count distinct works by hash-partitioning the data and then counting distinct elements by partition and finally summing the counts. In general it is a heavy operation due to the full shuffle and there is no silver bullet to that in Spark or most likely any fully distributed system, operations with distinct are inherently difficult to solve in a distributed system.
In some cases there are faster ways to do it:
If approximate values are acceptable, approx_count_distinct will usually be much faster as it is based on HyperLogLog and the amount of data to be shuffled is much much less than with the exact implementation.
If you can design your pipeline in a way that the data source is already partitioned so that there can't be any duplicates between partitions, the slow step of hash-partitioning the data frame is not needed.
P.S. To understand how count distinct work, you can always use explain:
df.select(countDistinct("foo")).explain()
Example output:
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(distinct foo#3)])
+- Exchange SinglePartition
+- *(2) HashAggregate(keys=[], functions=[partial_count(distinct foo#3)])
+- *(2) HashAggregate(keys=[foo#3], functions=[])
+- Exchange hashpartitioning(foo#3, 200)
+- *(1) HashAggregate(keys=[foo#3], functions=[])
+- LocalTableScan [foo#3]

Spark.table() with limit seems to reads whole table

I am trying to read the first 20000 rows of a large table (10bil+ rows) from spark so I use the following lines of code.
df = spark.table("large_table").limit(20000).repartition(1)
df.explain()
And the explain plan looks like this.
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(2) GlobalLimit 20000
+- Exchange SinglePartition
+- *(1) LocalLimit 20000
+- *(1) FileScan parquet large_table[...]
But when I try to write this df into a new table, it seems to kick off an insane number of tasks trying to read the whole table first and then begin writing to the final table! Why does spark not read only the first few files and get the row limit?

Resources