Spark result size is too big - usecase

Spark result size is too big - usecase - apache-spark

my input data is stored in Cassandra and I use a table which primary key is by year,month,day,hour as a source for Spark aggregations.
My Spark application does
Join two tables
Take joined tables and select data by hour
Union selected chunks by hour
Do aggregations on result Dataset and save to Cassandra
Simplifying
val ds1 = spark.read.cassandraFormat(table1, keyspace).load().as[T]
val ds2 = spark.read.cassandraFormat(table2, keyspace).load().as[T]
val dsInput = ds1.join(ds2).coalesce(150)
val dsUnion = for (x <- hours) yield dsInput.select( where hour = x)
val dsResult = mySparkAggregation( dsUnion.reduce(_.union(_)).coalesce(10) )
dsResult.saveToCassadnra
`
The result diagram looks like this (for 3 hours/unions)
Everything works ok when I do only couple of unions e.g 24 (for one day) but when I started running that Spark job for 1 month (720 unions) than I started getting such an error
Total size of serialized results of 1126 tasks (1024.8 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
The other alarming thing is that the job creates ~100k tasks and one of the stages (the one which caused the error above) contains 74400 tasks and when it processes 1125 it crashes because of maxResultSize. What is more it seems like it has to shuffle data for each hour (union).
I tried to coalesce the number of tasks after union - than it says that the task is too big.
I would be very grateful for any help, suggestions ? I have feeling that I do something wrong.
I did some investigation and got some conclusion
Let's say we have two tables
cb.people
CREATE TABLE cb.people (
id text PRIMARY KEY,
name text
)
and
cb.address
CREATE TABLE cb.address (
people_id text PRIMARY KEY,
name text
)
with the following data
cassandra#cqlsh> select * from cb.people;
id | name
----+---------
3 | Mariusz
2 | Monica
1 | John
cassandra#cqlsh> select * from cb.address;
people_id | name
-----------+--------
3 | POLAND
2 | USA
1 | USA
Now I would like to get joined result for id 1 and 2. There are two possible solutions.
Union two select for id 1 and 2 from table people and then join with the address table
scala> val people = spark.read.cassandraFormat("people", "cb").load()
scala> val usPeople = people.where(col("id") === "1") union people.where(col("id") === "2")
scala> val address = spark.read.cassandraFormat("address", "cb").load()
scala> val joined = usPeople.join(address, address.col("people_id") === usPeople.col("id"))
Join two tables and then union two select for id 1 and 2
scala> val peopleAddress = address.join(usPeople, address.col("people_id") === usPeople.col("id"))
scala> val joined2 = peopleAddress.where(col("id") === "1") union peopleAddress.where(col("id") === "2")
both return the same result
+---------+----+---+------+
|people_id|name| id| name|
+---------+----+---+------+
| 1| USA| 1| John|
| 2| USA| 2|Monica|
+---------+----+---+------+
But looking at the explain I can see a big difference
scala> joined.explain
== Physical Plan ==
*SortMergeJoin [people_id#10], [id#0], Inner
:- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(people_id#10, 200)
: +- *Filter (((people_id#10 = 1) || (people_id#10 = 2)) && isnotnull(people_id#10))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [Or(EqualTo(people_id,1),EqualTo(people_id,2)), IsNotNull(people_id)], ReadSchema: struct<people_id:string,name:string>
+- *Sort [id#0 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0, 200)
+- Union
:- *Filter isnotnull(id#0)
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
+- *Filter isnotnull(id#0)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
scala> joined2.explain
== Physical Plan ==
Union
:- *SortMergeJoin [people_id#10], [id#0], Inner
: :- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(people_id#10, 200)
: : +- *Filter isnotnull(people_id#10)
: : +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [*EqualTo(people_id,1), IsNotNull(people_id)], ReadSchema: struct<people_id:string,name:string>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- Union
: :- *Filter isnotnull(id#0)
: : +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
: +- *Filter (isnotnull(id#0) && (id#0 = 1))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2), EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
+- *SortMergeJoin [people_id#10], [id#0], Inner
:- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(people_id#10, 200)
: +- *Filter isnotnull(people_id#10)
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [IsNotNull(people_id), *EqualTo(people_id,2)], ReadSchema: struct<people_id:string,name:string>
+- *Sort [id#0 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0, 200)
+- Union
:- *Filter (isnotnull(id#0) && (id#0 = 2))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1), EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
+- *Filter isnotnull(id#0)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
Now it's quite clear for me that what I did was this joined2 version when in the loop for each union there was called join. I though that Spark would be enough smart to reduce that to the first version...
Now current graph looks much more better
I hope that other people will not make the same mistake I made :) Unfortunately I covered spark with my abstraction level which covers that simple problem so spark-shell helped a lot to model the problem.

Related

Spark Query using Inner join instead of full join

can anyone explain the below behaviour in spark sql join. It does not matter whether I am using full_join/full_outer/left/left_outer, the physical plan always shows that Inner join is being used..
q1 = spark.sql("select count(*) from table_t1 t1 full join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q1.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#19, item_id#20, store_id#23], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#19 ASC NULLS FIRST, item_id#20 ASC NULLS FIRST, store_id#23 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#19, item_id#20, store_id#23, 200)
+- *(3) Project [anchor_page_id#19, item_id#20, store_id#23]
+- *(3) Filter ((isnotnull(anchor_page_id#19) && isnotnull(item_id#20)) && isnotnull(store_id#23))
+- *(3) FileScan parquet table_t1[anchor_page_id#19,item_id#20,store_id#23,date_id#36] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#36), (date_id#36 = 20220324)], PushedFilters: [IsNotNull(anchor_page_id), IsNotNull(item_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q2 = spark.sql("select count(*) from table_t1 t1 full outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = '20220323' and t2.date_id = '20220324'")
q2.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#42, item_id#43, store_id#46], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (date_id#18 = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#42 ASC NULLS FIRST, item_id#43 ASC NULLS FIRST, store_id#46 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#42, item_id#43, store_id#46, 200)
+- *(3) Project [anchor_page_id#42, item_id#43, store_id#46]
+- *(3) Filter ((isnotnull(store_id#46) && isnotnull(anchor_page_id#42)) && isnotnull(item_id#43))
+- *(3) FileScan parquet table_t1[anchor_page_id#42,item_id#43,store_id#46,date_id#59] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#59), (date_id#59 = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(anchor_page_id), IsNotNull(item_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
>>>
q3 = spark.sql("select count(*) from table_t1 t1 left join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q3.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#65, item_id#66, store_id#69], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#65 ASC NULLS FIRST, item_id#66 ASC NULLS FIRST, store_id#69 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#65, item_id#66, store_id#69, 200)
+- *(3) Project [anchor_page_id#65, item_id#66, store_id#69]
+- *(3) Filter ((isnotnull(item_id#66) && isnotnull(store_id#69)) && isnotnull(anchor_page_id#65))
+- *(3) FileScan parquet table_t1[anchor_page_id#65,item_id#66,store_id#69,date_id#82] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#82), (cast(date_id#82 as int) = 20220324)], PushedFilters: [IsNotNull(item_id), IsNotNull(store_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
q4 = spark.sql("select count(*) from table_t1 t1 left outer join table_t1 t2 on t1.anchor_page_id = t2.anchor_page_id and t1.item_id = t2.item_id and t1.store_id = t2.store_id where t1.date_id = 20220323 and t2.date_id = 20220324")
q4.explain()
== Physical Plan ==
*(6) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *(5) HashAggregate(keys=[], functions=[partial_count(1)])
+- *(5) Project
+- *(5) SortMergeJoin [anchor_page_id#1, item_id#2, store_id#5], [anchor_page_id#88, item_id#89, store_id#92], Inner
:- *(2) Sort [anchor_page_id#1 ASC NULLS FIRST, item_id#2 ASC NULLS FIRST, store_id#5 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(anchor_page_id#1, item_id#2, store_id#5, 200)
: +- *(1) Project [anchor_page_id#1, item_id#2, store_id#5]
: +- *(1) Filter ((isnotnull(item_id#2) && isnotnull(anchor_page_id#1)) && isnotnull(store_id#5))
: +- *(1) FileScan parquet table_t1[anchor_page_id#1,item_id#2,store_id#5,date_id#18] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#18), (cast(date_id#18 as int) = 20220323)], PushedFilters: [IsNotNull(item_id), IsNotNull(anchor_page_id), IsNotNull(store_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>
+- *(4) Sort [anchor_page_id#88 ASC NULLS FIRST, item_id#89 ASC NULLS FIRST, store_id#92 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(anchor_page_id#88, item_id#89, store_id#92, 200)
+- *(3) Project [anchor_page_id#88, item_id#89, store_id#92]
+- *(3) Filter ((isnotnull(store_id#92) && isnotnull(item_id#89)) && isnotnull(anchor_page_id#88))
+- *(3) FileScan parquet table_t1[anchor_page_id#88,item_id#89,store_id#92,date_id#105] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[gs://abc..., PartitionCount: 1, PartitionFilters: [isnotnull(date_id#105), (cast(date_id#105 as int) = 20220324)], PushedFilters: [IsNotNull(store_id), IsNotNull(item_id), IsNotNull(anchor_page_id)], ReadSchema: struct<anchor_page_id:string,item_id:string,store_id:string>

Full join is full outer join.
A where clause on a form of 'outer join' is converted by Optimizer into an 'inner join'.
A where clause on any 'outer' table will make it an 'inner' table. I.e. only rows where that predicate can be evaluated will pass the filter.

Spark joinWith repartitions an already partitioned Dataset

Let's say we have two partitioned datasets
val partitionedPersonDS = personDS.repartition(200, personDS("personId"))
val partitionedTransactionDS = transactionDS.repartition(200, transactionDS("personId"))
And we try to join them using joinWith on the same key over which they are partitioned
val transactionPersonDS: Dataset[(Transaction, Person)] = partitionedTransactionDS
.joinWith(
partitionedPersonDS,
partitionedTransactionDS.col("personId") === partitionedPersonDS.col("personId")
)
The Physical plan shows that the already partitioned Dataset's were repartitioned as part of the Sort Merge Join
InMemoryTableScan [_1#14, _2#15]
+- InMemoryRelation [_1#14, _2#15], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(5) SortMergeJoin [_1#14.personId], [_2#15.personId], Inner
:- *(2) Sort [_1#14.personId ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(_1#14.personId, 200)
: +- *(1) Project [named_struct(transactionId, transactionId#8, personId, personId#9, itemList, itemList#10) AS _1#14]
: +- Exchange hashpartitioning(personId#9, 200)
: +- LocalTableScan [transactionId#8, personId#9, itemList#10]
+- *(4) Sort [_2#15.personId ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(_2#15.personId, 200)
+- *(3) Project [named_struct(personId, personId#2, name, name#3) AS _2#15]
+- Exchange hashpartitioning(personId#2, 200)
+- LocalTableScan [personId#2, name#3]
But when we perform the join using join the already partitioned Dataset's were NOT repartitioned and only Sorted as part of the Sort Merge Join
val transactionPersonDS: DataFrame = partitionedTransactionDS
.join (
partitionedPersonDS,
partitionedTransactionDS("personId") === partitionedPersonDS("personId")
)
InMemoryTableScan [transactionId#8, personId#9, itemList#10, personId#2, name#3]
+- InMemoryRelation [transactionId#8, personId#9, itemList#10, personId#2, name#3], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) SortMergeJoin [personId#9], [personId#2], Inner
:- *(1) Sort [personId#9 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(personId#9, 200)
: +- LocalTableScan [transactionId#8, personId#9, itemList#10]
+- *(2) Sort [personId#2 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(personId#2, 200)
+- LocalTableScan [personId#2, name#3]
Why joinWith fails to honor a pre-partitioned dataset unlike join

spark join performance: multiple column vs a single column

If I have columns [a,b,c] in df1 and [a,b,c] in df2, and also a column d, in both where d=concat_ws('_', *[a,b,c]) would there be a performance difference between:
df1.join(df2, [a,b,c])
df1.join(df2, d)
?

The question cannot be answered with yes or no as the answer depends on the details of the DataFrames.
The performance of a join depends to some good part on the question how much shuffling is necessary to execute it. If both sides of the join are partitioned by the same column(s) the join will be faster. You can see the effect of partitioning by looking at the execution plan of the join.
We create two DataFrames df1 and df2 with the columns a, b, c and d:
val sparkSession = ...
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
import sparkSession.implicits._
val cols = Seq("a","b","c")
def createDf = (1 to 3).map(i => (i,i,i)).toDF(cols:_*).withColumn("d", concat_ws("_", cols.map(col):_*))
val df1 = createDf
val df2 = createDf
df1 and df2 look both the same:
+---+---+---+-----+
| a| b| c| d|
+---+---+---+-----+
| 1| 1| 1|1_1_1|
| 2| 2| 2|2_2_2|
| 3| 3| 3|3_3_3|
+---+---+---+-----+
When we partition both DataFrames by column d and use this column as join condition
df1.repartition(4, col("d")).join(df2.repartition(4, col("d")), "d").explain()
we get the execution plan
== Physical Plan ==
*(3) Project [d#13, a#7, b#8, c#9, a#25, b#26, c#27]
+- *(3) SortMergeJoin [d#13], [d#31], Inner
:- *(1) Sort [d#13 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(d#13, 4)
: +- LocalTableScan [a#7, b#8, c#9, d#13]
+- *(2) Sort [d#31 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#25, b#26, c#27, d#31], Exchange hashpartitioning(d#13, 4)
Partitioning both DataFrames by d but joining over a, b and c
df1.repartition(4, col("d")).join(df2.repartition(4, col("d")), cols).explain()
leads to the execution plan
== Physical Plan ==
*(3) Project [a#7, b#8, c#9, d#13, d#31]
+- *(3) SortMergeJoin [a#7, b#8, c#9], [a#25, b#26, c#27], Inner
:- *(1) Sort [a#7 ASC NULLS FIRST, b#8 ASC NULLS FIRST, c#9 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#7, b#8, c#9, 200)
: +- Exchange hashpartitioning(d#13, 4)
: +- LocalTableScan [a#7, b#8, c#9, d#13]
+- *(2) Sort [a#25 ASC NULLS FIRST, b#26 ASC NULLS FIRST, c#27 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#25, b#26, c#27, d#31], Exchange hashpartitioning(a#7, b#8, c#9, 200)
which contains one Exchange hashpartitioning more than the first plan. In this case the join by a, b, c would be slower.
On the other side, if the DataFrames are partitioned by a, b and c the join by a, b, c would be faster than a join by d.

I'd suspect join without the concatenate to be faster because its likely cheaper to just hash the individual strings instead of concatenate and then hash. The former involves fewer java objects that need to be GC'd, but this isn't the full answer.
Keep in ming that this may not be the performance limiting step of your query and so either way would be just as fast. When it comes to performance tuning its best to test rather than guessing without data.
Also as mentioned above, leaving the columns unconcatenated gives the optimizer a chance to eliminate an exchange on the join if the input data is already partitioned correctly.
df1.join(df2, [a,b,c])
df1.join(df2, d)

Understand the plan tree string representation

I have a simple join query:
test("SparkSQLTest 0005") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest 0005").getOrCreate()
spark.range(100, 100000).createOrReplaceTempView("t1")
spark.range(2000, 10000).createOrReplaceTempView("t2")
val df = spark.sql("select count(1) from t1 join t2 on t1.id = t2.id")
df.explain(true)
}
The output is as follows :
I asked 5 questions marked as Q0~Q4 in the output, could some one help explain?Thanks!
== Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)] //Q0, Why the first line has no +- or :-
+- 'Join Inner, ('t1.id = 't2.id) //Q1, What does +- mean
:- 'UnresolvedRelation `t1` //Q2 What does :- mean
+- 'UnresolvedRelation `t2`
== Analyzed Logical Plan ==
count(1): bigint
Aggregate [count(1) AS count(1)#9L]
+- Join Inner, (id#0L = id#2L)
:- SubqueryAlias t1
: +- Range (100, 100000, step=1, splits=Some(1)) //Q3 What does : +- mean?
+- SubqueryAlias t2
+- Range (2000, 10000, step=1, splits=Some(1))
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#9L]
+- Project
+- Join Inner, (id#0L = id#2L)
:- Range (100, 100000, step=1, splits=Some(1)) //Q4 These two Ranges are both Join's children, why one is :- and the other is +-
+- Range (2000, 10000, step=1, splits=Some(1)) //Q4
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#9L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#11L])
+- *(2) Project
+- *(2) BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight
:- *(2) Range (100, 100000, step=1, splits=1)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Range (2000, 10000, step=1, splits=1)

They are bullet points simply representing ordered, nested operations
Header
Child 1
Grandchild 1
Child 2
Grandchild 2
Grandchild 3
Child 3
Would be written as
Header
:- Child 1
: +- Grandchild 1
:- Child 2
: :- Grandchild 2
: +- Grandchild 3
+- Child 3
+- A direct child, usually the last
:- A sibling of a direct child, but not the last
: +- The last grandchild, whose parent has a sibling
: :- A grandchild with a sibling, whose parent is non-final and also has a sibling

Spark SQL UNION - ORDER BY column not in SELECT

I'm doing a UNION of two temp tables and trying to order by column but spark complains that the column I am ordering by cannot be resolved. Is this a bug or I'm missing something?
lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
import org.apache.spark.sql.types.StringType
val oldOrders = Seq(
Seq("old_order_id1", "old_order_name1", "true"),
Seq("old_order_id2", "old_order_name2", "true")
)
val newOrders = Seq(
Seq("new_order_id1", "new_order_name1", "false"),
Seq("new_order_id2", "new_order_name2", "false")
)
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("is_old", StringType)
val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)
oldOrdersDF.createOrReplaceTempView("old_orders")
newOrdersDF.createOrReplaceTempView("new_orders")
//ordering by column not in select works if I'm not doing UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|ORDER BY oo.is_old
""".stripMargin).show()
//ordering by column not in select doesn't work as I'm doing a UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|UNION
|SELECT no.id, no.name FROM new_orders no
|ORDER BY oo.is_old
""".stripMargin).show()
The output of the above code is:
+-------------+---------------+
| id| name|
+-------------+---------------+
|old_order_id1|old_order_name1|
|old_order_id2|old_order_name2|
+-------------+---------------+
cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
So ordering by a column that's not in the SELECT clause works if I'm not doing a UNION and it fails if I'm doing a UNION of two tables.

The syntax of Spark SQL is very similar to SQL, but they are working very differently. Under the hood of Spark, its all about Rdds/dataframes.
After the UNION statement, a new dataframe is generated, and we are not able to refer the fields from the old table/dataframe if we did not select them.
How to fix:
spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
| UNION
| SELECT no.id, no.name, no.is_old FROM new_orders no
| ORDER BY oo.is_old
| ) t
""".stripMargin).show()
Thanks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark result size is too big - usecase - apache-spark

Related

Spark Query using Inner join instead of full join

Spark joinWith repartitions an already partitioned Dataset

spark join performance: multiple column vs a single column

Understand the plan tree string representation

Spark SQL UNION - ORDER BY column not in SELECT

Categories

Resources