Union in Spark SQL query removing duplicates from Dataset - apache-spark

I am using Java API for Apache Spark , and i have two Dataset A & B.
The schema for these both is same : PhoneNumber, Name, Age, Address
There is one record in both the Dataset that has PhoneNumber as common, but other columns in this record are different
I run following SQL query on these two Datasets (by registering these as temporary Table):
A.createOrReplaceTempView("A");
B.createOrReplaceTempView("B");
String query = "Select * from A UNION Select * from B";
Dataset<Row> result = sparkSession.sql(query);
result.show();
Surprisingly, the result has only one record with same PhoneNumber, and the other is removed.
I know UNION is SQL query is intended to remove duplicates, but then it also needs to know the Primary Key on the basis of which it decides what is duplicate.
How does this query infer the "Primary key" of my Dataset? (There is no concept of Primary key in Spark)

You can use either UNION ALL:
Seq((1L, "foo")).toDF.createOrReplaceTempView("a")
Seq((1L, "bar"), (1L, "foo")).toDF.createOrReplaceTempView("b")
spark.sql("SELECT * FROM a UNION ALL SELECT * FROM b").explain
== Physical Plan ==
Union
:- LocalTableScan [_1#152L, _2#153]
+- LocalTableScan [_1#170L, _2#171]
or Dataset.union method:
spark.table("a").union(spark.table("b")).explain
== Physical Plan ==
Union
:- LocalTableScan [_1#152L, _2#153]
+- LocalTableScan [_1#170L, _2#171]
How does this query infer the "Primary key" of my Dataset?
I doesn't, or at least not in the current version. It just applies HashAggregate using all available columns:
spark.sql("SELECT * FROM a UNION SELECT * FROM b").explain
== Physical Plan ==
*HashAggregate(keys=[_1#152L, _2#153], functions=[])
+- Exchange hashpartitioning(_1#152L, _2#153, 200)
+- *HashAggregate(keys=[_1#152L, _2#153], functions=[])
+- Union
:- LocalTableScan [_1#152L, _2#153]
+- LocalTableScan [_1#170L, _2#171]

Related

Does Spark respect kudu's hash partitioning similar to bucketed joins on parquet tables?

I'm trying out Kudu with Spark. I want to join 2 tables with the following schema-
# This table has around 1 million records
TABLE dimensions (
id INT32 NOT NULL,
PRIMARY KEY (id)
)
HASH (id) PARTITIONS 32,
RANGE (id) (
PARTITION UNBOUNDED
)
OWNER root
REPLICAS 1
# This table has 500 million records
TABLE facts (
id INT32 NOT NULL,
date DATE NOT NULL,
PRIMARY KEY (id, date)
)
HASH (id) PARTITIONS 32,
RANGE (id, date) (
PARTITION UNBOUNDED
)
OWNER root
REPLICAS 1
I inserted data to these tables using the following script-
// Load data to spark dataframe
val dimensions_raw = spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/root/dimensions.csv")
dimensions_raw.printSchema
dimensions_raw.createOrReplaceTempView("dimensions_raw")
// Set the primary key columns
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
def setNotNull(df: DataFrame, columns: Seq[String]) : DataFrame = {
val schema = df.schema
// Modify [[StructField] for the specified columns.
val newSchema = StructType(schema.map {
case StructField(c, t, _, m) if columns.contains(c) => StructField(c, t, nullable = false, m)
case y: StructField => y
})
// Apply new schema to the DataFrame
df.sqlContext.createDataFrame(df.rdd, newSchema)
}
val primaryKeyCols = Seq("id") // `primaryKeyCols` for `facts` table is `(id, date)`
val dimensions_prep = setNotNull(dimensions_raw, primaryKeyCols)
dimensions_prep.printSchema
dimensions_prep.createOrReplaceTempView("dimensions_prep")
// Create a kudu table
import collection.JavaConverters._
import org.apache.kudu.client._
import org.apache.kudu.spark.kudu._
val kuduContext = new KuduContext("localhost:7051", spark.sparkContext)
// Delete the table if it already exists.
if(kuduContext.tableExists("dimensions")) {
kuduContext.deleteTable("dimensions")
}
kuduContext.createTable("dimensions", dimensions_prep.schema,
/* primary key */ primaryKeyCols,
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("id").asJava, 32))
// Load the kudu table from spark dataframe
kuduContext.insertRows(dimensions_prep, "dimensions")
// Create a DataFrame that points to the Kudu table we want to query.
val dimensions = spark.read
.option("kudu.master", "localhost:7051")
.option("kudu.table", "dimensions")
.format("kudu").load
dimensions.createOrReplaceTempView("dimensions")
Ran the above script for facts table as well.
I want to join facts with dimensions table on id. I tried the following command in Spark-
val query = facts.join(dimensions, facts.col("id") === dimensions.col("id"))
query.show()
// And I get the following Physical plan-
== Physical Plan ==
*(5) SortMergeJoin [id#0], [id#14], Inner
:- *(2) Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200), true, [id=#43]
: +- *(1) Scan Kudu facts [id#0,date#1] PushedFilters: [], ReadSchema: struct<id:int,date:date...
+- *(4) Sort [id#14 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#14, 200), true, [id=#49]
+- *(3) Scan Kudu dimensions [id#14] PushedFilters: [], ReadSchema: struct<id:int>
My question is that how do I tell spark that the tables are already sorted on id (join key) so no need to sort again.
Moreover the Exchange hashpartitioning need not be done as the table is already bucketed over id.
The join query is taking sub 100seconds on a single machine with single master & tablet server running.
Am I doing something wrong here or is it the expected speed with Kudu for this kind of query?

steps in spark physical plan not assigned to DAG step

I am trying to debug a simple query in spark SQL that is returning incorrect data.
In this instance the query is a simple join between two hive tables ..
The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst engine) looks to be corrupted where some of the steps in the physical plan have not been assigned an order id and thus all evaluation on the right side of the join is not completed in the spark query
here is the example query
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')
joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ###
joined_df.explain(True)
here is an example of the physical plan returned by spark
== Physical Plan ==
SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(encntr_id#0, 200)
: +- *(1) Filter isnotnull(encntr_id#0)
: +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#a6df563
+- Sort [encntr_id#12 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(encntr_id#12, 200)
+- Filter isnotnull(encntr_id#12)
+- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader#60dd22d9
Note that all datasource scan , filter exchange and sort on the right side of the join have not been assigned an order id .
Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?
Figured this out internally .
Turns out the spark optimization routine can be affected by the configuration setting
spark.sql.codegen.Maxfields
which can have implications in how spark will optimize the read from 'fat' tables .
In my case the setting was set low which means DAG stages of the read from the right side of the join (the "fat" table) were performed without being assigned to a wholestage codegen .
Important to note that the read of the hive data in either case returned the same results just with a different optimization applied to the physical plan

What is SubqueryAlias node in analyzed logical plan?

I have a simple sql as follows,
test("SparkSQLTest") {
val spark = SparkSession.builder().master("local").appName("SparkSQLTest").getOrCreate()
spark.range(1, 100).createOrReplaceTempView("t1")
val df = spark.sql("select id from t1 where t1.id = 10")
df.explain(true)
}
The output for the analyzed logical plan is:
== Analyzed Logical Plan ==
id: bigint
Project [id#0L]
+- Filter (id#0L = cast(10 as bigint))
+- SubqueryAlias t1 ////don't understand here
+- Range (1, 100, step=1, splits=Some(1))
Why does the SubqueryAlias show up int the logical plan? In my sql, I don't have alias related operations.
Could some one help explain? Thanks!
SubqueryAlias is an unary logical operator that gives an alias for the (child) subquery it was created for. The alias can be used in another part of a structured query for a correlated subquery.
SubqueryAlias (and aliases in general) are available until Spark Optimizer has finished query optimization phase (using EliminateSubqueryAliases optimization rule).
Quoting EliminateSubqueryAliases optimization:
Subqueries are only required to provide scoping information for attributes and can be removed once analysis is complete.
In your query the subquery is the part before createOrReplaceTempView("t1").
spark.range(1, 100).createOrReplaceTempView("t1")
You could rewrite the above structured query into the following that would change nothing, but gives a more elaborative explanation.
val q = spark.range(1, 100)
q.createOrReplaceTempView("t1")
So, q could be any other structured query and hence the need for an alias to reference any output attribute from the subquery.
When you explain the query you won't see any SubqueryAlias nodes (and that's not only because the logical query plan gets planned to a physical query plan where physical operators are used).

How to enable Catalyst Query Optimiser in Spark SQL?

Whether I use Spark-SQL directly or Spark-Shell, I have no idea to check the operation of Spark Catalyst Query Optimizer in explicit way.
For example, let we assume that I made HiveContext as follows:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Then, when I try to process a query as:
hiveContext.sql("""
| SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand,SUM(ss_ext_sales_price) sum_agg
| FROM date_dim dt, store_sales, item
| WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
| AND store_sales.ss_item_sk = item.i_item_sk
| AND item.i_manufact_id = 128
| AND dt.d_moy=11
| GROUP BY dt.d_year, item.i_brand, item.i_brand_id
| ORDER BY dt.d_year, sum_agg desc, brand_id
| LIMIT 100
""").collect().foreach(println)
Is there way to check the existence of catalyst optimizer?
If not exists, then how can we enable catalyst optimizer for HiveContext?
Catalyst Query Optimizer is always enabled in Spark 2.0. It is a part of the optimizations you get for free when you work with Spark 2.0's Datasets (and one of the many reasons you should really be using Datasets before going low level with RDDs).
If you want to see the optimizations Catalyst Query Optimizer applied to your query, use TRACE logging level for SparkOptimizer in conf/log4j.properties:
log4j.logger.org.apache.spark.sql.execution.SparkOptimizer=TRACE
With that whenever you trigger execution of your query (through show, collect or a mere explain) you'll see tons of logs with the work Catalyst Query Optimizer is doing for you every time you execute a query.
Let's see Column Pruning optimization rule in action.
// the business object
case class Person(id: Long, name: String, city: String)
// the dataset to query over
val dataset = Seq(Person(0, "Jacek", "Warsaw")).toDS
// the query
// Note that we work with names only (out of 3 attributes in Person)
val query = dataset.groupBy(upper('name) as 'name).count
scala> query.explain(extended = true)
...
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning ===
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L] Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
!+- LocalRelation [id#125L, name#126, city#127] +- Project [name#126]
! +- LocalRelation [id#125L, name#126, city#127]
...
== Parsed Logical Plan ==
'Aggregate [upper('name) AS name#160], [upper('name) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Analyzed Logical Plan ==
name: string, count: bigint
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [id#125L, name#126, city#127]
== Optimized Logical Plan ==
Aggregate [upper(name#126)], [upper(name#126) AS name#160, count(1) AS count#166L]
+- LocalRelation [name#126]
== Physical Plan ==
*HashAggregate(keys=[upper(name#126)#171], functions=[count(1)], output=[name#160, count#166L])
+- Exchange hashpartitioning(upper(name#126)#171, 200)
+- *HashAggregate(keys=[upper(name#126) AS upper(name#126)#171], functions=[partial_count(1)], output=[upper(name#126)#171, count#173L])
+- LocalTableScan [name#126]

Why is Spark SQL in Spark 1.6.1 not using broadcast join in CTAS?

I have a query in Spark SQL which is using broadcast join as expected as my table b is smaller than spark.sql.autoBroadcastJoinThreshold.
However, if I put the exact same select query into an CTAS query then it's NOT using broadcast join for some reason.
The select query looks like this:
select id,name from a join b on a.name = b.bname;
And the explain for this looks this:
== Physical Plan ==
Project [id#1,name#2]
+- BroadcastHashJoin [name#2], [bname#3], BuildRight
:- Scan ParquetRelation: default.a[id#1,name#2] InputPaths: ...
+- ConvertToUnsafe
+- HiveTableScan [bname#3], MetastoreRelation default, b, Some(b)
Then my CTAS looks like this:
create table c as select id,name from a join b on a.name = b.bname;
And the explain for this one returns:
== Physical Plan ==
ExecutedCommand CreateTableAsSelect [Database:default}, TableName: c, InsertIntoHiveTable]
+- Project [id#1,name#2]
+- Join Inner, Some((name#2 = bname#3))
:- Relation[id#1,name#2] ParquetRelation: default.a
+- MetastoreRelation default, b, Some(b)
Is it expected to NOT use broadcast join for the select query that's part of a CTAS query? If not, is there a way to force CTAS to use broadcast join?
If your question is about the reason why Spark creates two different physical plans then this answer won't be helpful. I have observed plenty of sensitivity in Spark's optimizer where the same SQL snippets result in meaningfully different physical plans even if it is not obvious why that is the case.
However, if your question is ultimately about how to execute the CTAS with a broadcast join then here is a simple workaround I have used many times: register the query with the plan you like as a temporary table (or view if you are using the SQL console) and then use SELECT * from tmp_tbl as the query to feed the CTAS.
In other words, something like:
sql("select id, name from a join b on a.name = b.bname").registerTempTable("tmp_joined")
sql("create table c as select * from tmp_joined")

Resources