If I have columns [a,b,c] in df1 and [a,b,c] in df2, and also a column d, in both where d=concat_ws('_', *[a,b,c]) would there be a performance difference between:
df1.join(df2, [a,b,c])
df1.join(df2, d)
?
The question cannot be answered with yes or no as the answer depends on the details of the DataFrames.
The performance of a join depends to some good part on the question how much shuffling is necessary to execute it. If both sides of the join are partitioned by the same column(s) the join will be faster. You can see the effect of partitioning by looking at the execution plan of the join.
We create two DataFrames df1 and df2 with the columns a, b, c and d:
val sparkSession = ...
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
import sparkSession.implicits._
val cols = Seq("a","b","c")
def createDf = (1 to 3).map(i => (i,i,i)).toDF(cols:_*).withColumn("d", concat_ws("_", cols.map(col):_*))
val df1 = createDf
val df2 = createDf
df1 and df2 look both the same:
+---+---+---+-----+
| a| b| c| d|
+---+---+---+-----+
| 1| 1| 1|1_1_1|
| 2| 2| 2|2_2_2|
| 3| 3| 3|3_3_3|
+---+---+---+-----+
When we partition both DataFrames by column d and use this column as join condition
df1.repartition(4, col("d")).join(df2.repartition(4, col("d")), "d").explain()
we get the execution plan
== Physical Plan ==
*(3) Project [d#13, a#7, b#8, c#9, a#25, b#26, c#27]
+- *(3) SortMergeJoin [d#13], [d#31], Inner
:- *(1) Sort [d#13 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(d#13, 4)
: +- LocalTableScan [a#7, b#8, c#9, d#13]
+- *(2) Sort [d#31 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#25, b#26, c#27, d#31], Exchange hashpartitioning(d#13, 4)
Partitioning both DataFrames by d but joining over a, b and c
df1.repartition(4, col("d")).join(df2.repartition(4, col("d")), cols).explain()
leads to the execution plan
== Physical Plan ==
*(3) Project [a#7, b#8, c#9, d#13, d#31]
+- *(3) SortMergeJoin [a#7, b#8, c#9], [a#25, b#26, c#27], Inner
:- *(1) Sort [a#7 ASC NULLS FIRST, b#8 ASC NULLS FIRST, c#9 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#7, b#8, c#9, 200)
: +- Exchange hashpartitioning(d#13, 4)
: +- LocalTableScan [a#7, b#8, c#9, d#13]
+- *(2) Sort [a#25 ASC NULLS FIRST, b#26 ASC NULLS FIRST, c#27 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#25, b#26, c#27, d#31], Exchange hashpartitioning(a#7, b#8, c#9, 200)
which contains one Exchange hashpartitioning more than the first plan. In this case the join by a, b, c would be slower.
On the other side, if the DataFrames are partitioned by a, b and c the join by a, b, c would be faster than a join by d.
I'd suspect join without the concatenate to be faster because its likely cheaper to just hash the individual strings instead of concatenate and then hash. The former involves fewer java objects that need to be GC'd, but this isn't the full answer.
Keep in ming that this may not be the performance limiting step of your query and so either way would be just as fast. When it comes to performance tuning its best to test rather than guessing without data.
Also as mentioned above, leaving the columns unconcatenated gives the optimizer a chance to eliminate an exchange on the join if the input data is already partitioned correctly.
df1.join(df2, [a,b,c])
df1.join(df2, d)
Related
I have a pypsark dataframe like this:
| name|segment_list|rung_list |
+--------------------+------------+-----------+
| Campaign 1 | [1.0, 5.0]| [L2, L3]|
| Campaign 1 | [1.1]| [L1]|
| Campaign 2 | [1.2]| [L2]|
| Campaign 2 | [1.1]| [L4, L5]|
+--------------------+------------+-----------+
I have another pyspark dataframe that has segment and rung for every customer:
+-----------+---------------+---------+
|customer_id| segment |rung |
+-----------+---------------+---------+
| 124001823| 1.0| L2|
| 166001989| 5.0| L2|
| 768002266| 1.1| L1|
+-----------+---------------+---------+
What I want is a final output that figures out the customers based on the segment and rung list. The final output should be something like the following:
| name|customer_id |
+--------------------+------------+
| Campaign 1 | 124001823 |
| Campaign 1 | 166001989 |
| Campaign 1 | 768002266 |
+--------------------+------------+
I tried using udf but that approach didnt quite work. I would like to avoid using a for loop on a collect operation or going row by row. So I am primarily looking for a groupby operation on name column.
So I want a better way to do the following:
for row in x.collect():
y = eligible.filter(eligible.segment.isin(row['segment_list'])).filter(eligible.rung.isin(row['rung_list']))
you could try to use array_contains for the join conditions.
here's an example
data1_sdf. \
join(data2_sdf,
func.expr('array_contains(segment_list, segment)') & func.expr('array_contains(rung_list, rung)'),
'left'
). \
select('name', 'customer_id'). \
dropDuplicates(). \
show(truncate=False)
# +----------+-----------+
# |name |customer_id|
# +----------+-----------+
# |Campaign 1|166001989 |
# |Campaign 1|124001823 |
# |Campaign 1|768002266 |
# |Campaign 2|null |
# +----------+-----------
pasting the query plan spark produced
== Parsed Logical Plan ==
Deduplicate [name#123, customer_id#129]
+- Project [name#123, customer_id#129]
+- Join LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- LogicalRDD [name#123, segment_list#124, rung_list#125], false
+- LogicalRDD [customer_id#129, segment#130, rung#131], false
== Analyzed Logical Plan ==
name: string, customer_id: string
Deduplicate [name#123, customer_id#129]
+- Project [name#123, customer_id#129]
+- Join LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- LogicalRDD [name#123, segment_list#124, rung_list#125], false
+- LogicalRDD [customer_id#129, segment#130, rung#131], false
== Optimized Logical Plan ==
Aggregate [name#123, customer_id#129], [name#123, customer_id#129]
+- Project [name#123, customer_id#129]
+- Join LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- LogicalRDD [name#123, segment_list#124, rung_list#125], false
+- Filter (isnotnull(segment#130) AND isnotnull(rung#131))
+- LogicalRDD [customer_id#129, segment#130, rung#131], false
== Physical Plan ==
*(4) HashAggregate(keys=[name#123, customer_id#129], functions=[], output=[name#123, customer_id#129])
+- Exchange hashpartitioning(name#123, customer_id#129, 200), ENSURE_REQUIREMENTS, [id=#267]
+- *(3) HashAggregate(keys=[name#123, customer_id#129], functions=[], output=[name#123, customer_id#129])
+- *(3) Project [name#123, customer_id#129]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, (array_contains(segment_list#124, segment#130) AND array_contains(rung_list#125, rung#131))
:- *(1) Scan ExistingRDD[name#123,segment_list#124,rung_list#125]
+- BroadcastExchange IdentityBroadcastMode, [id=#261]
+- *(2) Filter (isnotnull(segment#130) AND isnotnull(rung#131))
+- *(2) Scan ExistingRDD[customer_id#129,segment#130,rung#131]
seems it is not well optimized, I'm thinking there can be other optimized methods.
Say we have the following dataframe (which is borrowed from 'PySpark by Examples' website):
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000), \
("Raman","Finance","CA",99000,40,24000), \
("Scott","Finance","NY",83000,36,19000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
Then, if we run the two following sort (orderBy) commands:
df.sort("department","state").show(truncate=False)
or
df.sort(col("department"),col("state")).show(truncate=False)
We get the same result:
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+
I know the first one takes the DataFrame column name as a string and the next one takes columns in Column type. But is there a difference between these two in case of tasks such as processing or future uses? Is one of them better than the other or pySpark standard form? Or are they just aliases?
PS: In addition to the above, one of the reasons I'm asking this question is that someone told me there is a 'standard' business form for using Spark. For example, 'alias' is more popular than 'withColumnRenamed' in the business. Of course, this doesn't sound right to me.
If you look at the explain plan you'll see that both queries generate the same physical plan, so processing wise they are identical.
df_sort1 = df.sort("department", "state")
df_sort2 = df.sort(col("department"), col("state"))
df_sort1.explain()
df_sort2.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#8]
+- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#18]
+- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]
Businesses might have coding guidelines in which they specify what to use. If they exist then follow them. If not and you're working on existing code then usually best to follow what is there already. Otherwise its mainly preference, I'm not aware of a 'standard business form' of pyspark.
In case of alias vs withColumnRenamed there is an argument to be made in favor of alias if you're renaming multiple columns. Selecting with alias will generate a single projection in the parsed logical plan where using multiple withColumnRenamed will generate multiple projections.
To be certain that the two versions do the same thing, we can have a look at the source code of dataframe.py. Here is the signature of the sort method:
def sort(
self, *cols: Union[str, Column, List[Union[str, Column]]], **kwargs: Any
) -> "DataFrame":
When you follow the various method calls, you end up on this line:
jcols = [_to_java_column(cast("ColumnOrName", c)) for c in cols]
, that converts all column objects, whether they are strings or columns (cf method signature) to java columns. Then only the java columns are used regardless of how they were passed to the method so the two versions of the sort method do the exact same thing with the exact same code.
I want to know if a Window used x times will perform x times shuffle of the data.
Example :
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('col_a').orderBy('date')
df = df.withColumn('new_col_1', F.lag('col_b').over(w))
df = df.withColumn('new_col_2', F.row_number().over(w))
Will this code perform 1 shuffle of the data because there's 1 Window ?
Or 2 shuffle of the data because the Window is used twice ?
If the answer is 2 shuffle, would a repartitioning by col_a reduce the amount of shuffle to 1 like in below code example ?
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window.partitionBy(col_a).orderBy('date')
df = df.repartition('col_a')
df = df.withColumn('new_col_1', F.lag('col_b').over(w))
df = df.withColumn('new_col_2', F.row_number().over(w))
If we display how spark will compute this dataframe with explain, we get the following execution plan:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('col_a').orderBy('date')
df = df.withColumn('new_col_1', F.lag('col_b').over(w))
df = df.withColumn('new_col_2', F.row_number().over(w))
df.explain()
# == Physical Plan ==
# Window [lag(col_b#2, -1, null) windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS new_col_1#19, row_number() windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_col_2#25], [col_a#1L], [date#0 ASC NULLS FIRST]
# +- *(2) Sort [col_a#1L ASC NULLS FIRST, date#0 ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(col_a#1L, 200), ENSURE_REQUIREMENTS, [id=#23]
# +- *(1) Scan ExistingRDD[date#0,col_a#1L,col_b#2]
As you can see, there is only one Exchange (meaning one shuffle) step. So there is only one shuffle if you reuse your window to compute several columns, if there is no shuffle between those computation. Moreover, there is only one Window step, meaning that the two columns using window are actually computed during the same step and not one after other.
Others cases
If we repartition by col_a before computing columns windows, the execution plan is the same than without repartition:
w = Window.partitionBy('col_a').orderBy('date')
df = df.repartition('col_a')
df = df.withColumn('new_col_1', F.lag('col_b').over(w))
df = df.withColumn('new_col_2', F.row_number().over(w))
df.explain()
# == Physical Plan ==
# Window [lag(col_b#2, -1, null) windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS new_col_1#19, row_number() windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_col_2#25], [col_a#1L], [date#0 ASC NULLS FIRST]
# +- *(2) Sort [col_a#1L ASC NULLS FIRST, date#0 ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(col_a#1L, 200), REPARTITION, [id=#26]
# +- *(1) Scan ExistingRDD[date#0,col_a#1L,col_b#2]
If we repartition by col_a between the two column computations that use window, the two columns are no longer computed in the same step:
w = Window.partitionBy('col_a').orderBy('date')
df = df.withColumn('new_col_1', F.lag('col_b').over(w))
df = df.repartition('col_a')
df = df.withColumn('new_col_2', F.row_number().over(w))
df.explain()
# == Physical Plan ==
# Window [row_number() windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_col_2#25], [col_a#1L], [date#0 ASC NULLS FIRST]
# +- Window [lag(col_b#2, -1, null) windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS new_col_1#19], [col_a#1L], [date#0 ASC NULLS FIRST]
# +- *(2) Sort [col_a#1L ASC NULLS FIRST, date#0 ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(col_a#1L, 200), ENSURE_REQUIREMENTS, [id=#33]
# +- *(1) Scan ExistingRDD[date#0,col_a#1L,col_b#2]
if we repartition by col_b between the two window column computations, we get 3 shuffles. So using the same window trigger one shuffle only if there is no repartition/shuffle using other columns between window column computations:
w = Window.partitionBy('col_a').orderBy('date')
df = df.withColumn('new_col_1', F.lag('col_b').over(w))
df = df.repartition('col_b')
df = df.withColumn('new_col_2', F.row_number().over(w))
df.explain()
# == Physical Plan ==
# Window [row_number() windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_col_2#25], [col_a#1L], [date#0 ASC NULLS FIRST]
# +- *(3) Sort [col_a#1L ASC NULLS FIRST, date#0 ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(col_a#1L, 200), ENSURE_REQUIREMENTS, [id=#42]
# +- Exchange hashpartitioning(col_b#2, 200), REPARTITION, [id=#41]
# +- Window [lag(col_b#2, -1, null) windowspecdefinition(col_a#1L, date#0 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS new_col_1#19], [col_a#1L], [date#0 ASC NULLS FIRST]
# +- *(2) Sort [col_a#1L ASC NULLS FIRST, date#0 ASC NULLS FIRST], false, 0
# +- Exchange hashpartitioning(col_a#1L, 200), ENSURE_REQUIREMENTS, [id=#36]
# +- *(1) Scan ExistingRDD[date#0,col_a#1L,col_b#2]
my input data is stored in Cassandra and I use a table which primary key is by year,month,day,hour as a source for Spark aggregations.
My Spark application does
Join two tables
Take joined tables and select data by hour
Union selected chunks by hour
Do aggregations on result Dataset and save to Cassandra
Simplifying
val ds1 = spark.read.cassandraFormat(table1, keyspace).load().as[T]
val ds2 = spark.read.cassandraFormat(table2, keyspace).load().as[T]
val dsInput = ds1.join(ds2).coalesce(150)
val dsUnion = for (x <- hours) yield dsInput.select( where hour = x)
val dsResult = mySparkAggregation( dsUnion.reduce(_.union(_)).coalesce(10) )
dsResult.saveToCassadnra
`
The result diagram looks like this (for 3 hours/unions)
Everything works ok when I do only couple of unions e.g 24 (for one day) but when I started running that Spark job for 1 month (720 unions) than I started getting such an error
Total size of serialized results of 1126 tasks (1024.8 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
The other alarming thing is that the job creates ~100k tasks and one of the stages (the one which caused the error above) contains 74400 tasks and when it processes 1125 it crashes because of maxResultSize. What is more it seems like it has to shuffle data for each hour (union).
I tried to coalesce the number of tasks after union - than it says that the task is too big.
I would be very grateful for any help, suggestions ? I have feeling that I do something wrong.
I did some investigation and got some conclusion
Let's say we have two tables
cb.people
CREATE TABLE cb.people (
id text PRIMARY KEY,
name text
)
and
cb.address
CREATE TABLE cb.address (
people_id text PRIMARY KEY,
name text
)
with the following data
cassandra#cqlsh> select * from cb.people;
id | name
----+---------
3 | Mariusz
2 | Monica
1 | John
cassandra#cqlsh> select * from cb.address;
people_id | name
-----------+--------
3 | POLAND
2 | USA
1 | USA
Now I would like to get joined result for id 1 and 2. There are two possible solutions.
Union two select for id 1 and 2 from table people and then join with the address table
scala> val people = spark.read.cassandraFormat("people", "cb").load()
scala> val usPeople = people.where(col("id") === "1") union people.where(col("id") === "2")
scala> val address = spark.read.cassandraFormat("address", "cb").load()
scala> val joined = usPeople.join(address, address.col("people_id") === usPeople.col("id"))
Join two tables and then union two select for id 1 and 2
scala> val peopleAddress = address.join(usPeople, address.col("people_id") === usPeople.col("id"))
scala> val joined2 = peopleAddress.where(col("id") === "1") union peopleAddress.where(col("id") === "2")
both return the same result
+---------+----+---+------+
|people_id|name| id| name|
+---------+----+---+------+
| 1| USA| 1| John|
| 2| USA| 2|Monica|
+---------+----+---+------+
But looking at the explain I can see a big difference
scala> joined.explain
== Physical Plan ==
*SortMergeJoin [people_id#10], [id#0], Inner
:- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(people_id#10, 200)
: +- *Filter (((people_id#10 = 1) || (people_id#10 = 2)) && isnotnull(people_id#10))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [Or(EqualTo(people_id,1),EqualTo(people_id,2)), IsNotNull(people_id)], ReadSchema: struct<people_id:string,name:string>
+- *Sort [id#0 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0, 200)
+- Union
:- *Filter isnotnull(id#0)
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
+- *Filter isnotnull(id#0)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
scala> joined2.explain
== Physical Plan ==
Union
:- *SortMergeJoin [people_id#10], [id#0], Inner
: :- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(people_id#10, 200)
: : +- *Filter isnotnull(people_id#10)
: : +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [*EqualTo(people_id,1), IsNotNull(people_id)], ReadSchema: struct<people_id:string,name:string>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- Union
: :- *Filter isnotnull(id#0)
: : +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
: +- *Filter (isnotnull(id#0) && (id#0 = 1))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2), EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
+- *SortMergeJoin [people_id#10], [id#0], Inner
:- *Sort [people_id#10 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(people_id#10, 200)
: +- *Filter isnotnull(people_id#10)
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#3077e4aa [people_id#10,name#11] PushedFilters: [IsNotNull(people_id), *EqualTo(people_id,2)], ReadSchema: struct<people_id:string,name:string>
+- *Sort [id#0 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0, 200)
+- Union
:- *Filter (isnotnull(id#0) && (id#0 = 2))
: +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,1), EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
+- *Filter isnotnull(id#0)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation#6846e4e8 [id#0,name#1] PushedFilters: [IsNotNull(id), *EqualTo(id,2)], ReadSchema: struct<id:string,name:string>
Now it's quite clear for me that what I did was this joined2 version when in the loop for each union there was called join. I though that Spark would be enough smart to reduce that to the first version...
Now current graph looks much more better
I hope that other people will not make the same mistake I made :) Unfortunately I covered spark with my abstraction level which covers that simple problem so spark-shell helped a lot to model the problem.
I'm doing a UNION of two temp tables and trying to order by column but spark complains that the column I am ordering by cannot be resolved. Is this a bug or I'm missing something?
lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
import org.apache.spark.sql.types.StringType
val oldOrders = Seq(
Seq("old_order_id1", "old_order_name1", "true"),
Seq("old_order_id2", "old_order_name2", "true")
)
val newOrders = Seq(
Seq("new_order_id1", "new_order_name1", "false"),
Seq("new_order_id2", "new_order_name2", "false")
)
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("is_old", StringType)
val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)
oldOrdersDF.createOrReplaceTempView("old_orders")
newOrdersDF.createOrReplaceTempView("new_orders")
//ordering by column not in select works if I'm not doing UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|ORDER BY oo.is_old
""".stripMargin).show()
//ordering by column not in select doesn't work as I'm doing a UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|UNION
|SELECT no.id, no.name FROM new_orders no
|ORDER BY oo.is_old
""".stripMargin).show()
The output of the above code is:
+-------------+---------------+
| id| name|
+-------------+---------------+
|old_order_id1|old_order_name1|
|old_order_id2|old_order_name2|
+-------------+---------------+
cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
So ordering by a column that's not in the SELECT clause works if I'm not doing a UNION and it fails if I'm doing a UNION of two tables.
The syntax of Spark SQL is very similar to SQL, but they are working very differently. Under the hood of Spark, its all about Rdds/dataframes.
After the UNION statement, a new dataframe is generated, and we are not able to refer the fields from the old table/dataframe if we did not select them.
How to fix:
spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
| UNION
| SELECT no.id, no.name, no.is_old FROM new_orders no
| ORDER BY oo.is_old
| ) t
""".stripMargin).show()
Thanks.