Hive on spark, partition pruning, better undertanding - apache-spark

I have a spark 1.6.2 code using SQL/HQL language.
I really tried to understand if my job is doing partition pruning or not.
Data is partitioned by date (cdate field)
the explain plan is :
== Physical Plan ==
Project [coalesce(cdate#74,cdate#38) AS cdate#29,coalesce(account_key#75,account_key#34) AS account_key#30,coalesce(product#76,product#35) AS product#31,(coalesce(amount#77,0.0) + coalesce(amount#36,0.0)) AS amount#32,(coalesce(volume#78L,0) + cast(coalesce(volume#37,0) as bigint)) AS volume#33L]
+- SortMergeOuterJoin [account_key#34,cdate#38,product#35], [account_key#75,cdate#74,product#76], FullOuter, None
:- Sort [account_key#34 ASC,cdate#38 ASC,product#35 ASC], false, 0
: +- TungstenExchange hashpartitioning(account_key#34,cdate#38,product#35,200), None
: +- Project [volume#37,product#35,cdate#38,account_key#34,amount#36]
: +- BroadcastHashJoin [cdate#38], [cdate#24], BuildLeft
: :- Scan ParquetRelation[account_key#34,product#35,amount#36,volume#37,cdate#38] InputPaths: hdfs://hdp1.voicelab.local:8020/apps/hive/warehouse/my.db/daily_profiles
: +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
: +- TungstenExchange hashpartitioning(cdate#24,200), None
: +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
: +- Project [cdate#24]
: +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#24])
: +- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
: +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#20,accountKey#21,product#22])
: +- Project [cdate#20,accountKey#21,product#22]
: +- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
+- Sort [account_key#75 ASC,cdate#74 ASC,product#76 ASC], false, 0
+- TungstenExchange hashpartitioning(account_key#75,cdate#74,product#76,200), None
+- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Final,isDistinct=false),(count(1),mode=Final,isDistinct=false)], output=[cdate#74,account_key#75,product#76,amount#77,volume#78L])
+- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
+- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Partial,isDistinct=false),(count(1),mode=Partial,isDistinct=false)], output=[cdate#20,accountKey#21,product#22,sum#54,count#55L])
+- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
How can I figure out if my job is using the metastore in order to do partition pruning.
Can you elaborate about Scan ParquetRelation? how can I know that the scan using partition pruning/discovery ?
what is the meaning for the field#SOME_NUMBER i.e account_key#34
The use case is aggregating data per date,account,product

Look for PartitionFilters: [... ] in the Physical plan. If the array has a non empty value, it's using otherwise no. I couldn't find in your plan, unless I missed it or could not find it.

Related

pyspark joinWithCassandraTable refactor without maps

Im new to using spark/scala here and im having trouble with a refactor of some of my code here. Im running Scala 2.11 using pyspark and in a spark/yarn setup. The following is working but id like to clean it up, and to get the max performance out of this. I read elsewhere that pyspark udf and lambdas can cause huge performance impact so im trying to reduce or remove them were possible.
# Reduce ingest df1 data by joining on allowed table df2
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.rdd\
.map(lambda r: Row(tag=r['tag_id'], user_uuid=r['user_uuid']))
# Type column fixed to type=2, and tag==key
ready_to_join = to_process.map(lambda r: (r[0], 2, r[1]))
# Join with cassandra table to find matches
exists_in_cass = ready_to_join\
.joinWithCassandraTable(keyspace, table3)\
.on("user_uuid", "type")\
.select("user_uuid")
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
the cassandra table is such that
CREATE TABLE keyspace.table3 (
user_uuid uuid,
type int,
key text,
value text,
PRIMARY KEY (user_uuid, type, key)
) WITH CLUSTERING ORDER BY (type ASC, key ASC)
currently ive got
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.select(col("user_uuid"), col("tag_id").alias("tag"))
ready_to_join = to_process\
.withColumn("type", sf.lit(2))\
.select('user_uuid', 'type', col('tag').alias("key"))\
.rdd\
.map(lambda x: Row(x))
# planning on using repartitionByCassandraReplica here after I get it logically working
exists_in_cass = ready_to_join\
.joinWithCassandraTable(keyspace, table3)\
.on("user_uuid", "type")\
.select("user_uuid")
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
but im getting errors like
2020-10-30 15:10:42 WARN TaskSetManager:66 - Lost task 148.0 in stage 22.0 (TID ----, ---, executor 9): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
looking help from any spark gurus out there to point me to anything stupid I am doing here.
Update
Thanks to Alex's suggestion using the spark-cassandra-connector v2.5+ gives the ability for dataframes to join directly. I updated my code to use this instead.
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.select(col("user_uuid"), col("tag_id").alias("tag"))
ready_to_join = to_process\
.withColumn("type", sf.lit(2))\
.select(col('user_uuid').alias('c1_user_uuid'), 'type', col('tag').alias("key"))\
cass_table = spark_session
.read \
.format("org.apache.spark.sql.cassandra") \
.options(table=config.table, keyspace=config.keyspace) \
.load()
exists_in_cass = ready_to_join\
.join(
cass_table,
[(cass_table["user_uuid"] == ready_to_join["c1_user_uuid"]) &
(cass_table["key"] == ready_to_join["key"]) &
(cass_table["type"] == ready_to_join["type"])])\
.select(col("c1_user_uuid").alias("user_uuid"))
exists_in_cass.explain()
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
As far as I know, in theory this should be alot faster ! But im getting errors during runtime with the database timing out.
WARN TaskSetManager:66 - Lost task 827.0 in stage 12.0 (TID 9946, , executor 4): java.io.IOException: Exception during execution of SELECT "user_uuid", "key" FROM "keyspace"."table3" WHERE token("user_uuid") > ? AND token("user_uuid") <= ? AND "type" = ? ALLOW FILTERING: Query timed out after PT2M
TaskSetManager:66 - Lost task 125.0 in stage 12.0 (TID 9215, , executor 7): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2M
etc
I have the config for spark setup to allow for the spark extensions
--packages mysql:mysql-connector-java:5.1.47,com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
The DAG from spark shows all nodes completely maxed out. Should I be partitioning my data before running my join here?
The explain for this also doesnt show a direct join (explain has more code than snippet above)
== Physical Plan ==
*(6) Project [c1_user_uuid#124 AS user_uuid#158]
+- *(6) SortMergeJoin [c1_user_uuid#124, key#125L], [user_uuid#129, cast(key#131 as bigint)], Inner
:- *(3) Sort [c1_user_uuid#124 ASC NULLS FIRST, key#125L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(c1_user_uuid#124, key#125L, 200)
: +- *(2) Project [id#0 AS c1_user_uuid#124, tag_id#101L AS key#125L]
: +- *(2) BroadcastHashJoin [secondary_id#60], [secondary_id#100], Inner, BuildRight
: :- *(2) Filter (isnotnull(secondary_id#60) && isnotnull(id#0))
: : +- InMemoryTableScan [secondary_id#60, id#0], [isnotnull(secondary_id#60), isnotnull(id#0)]
: : +- InMemoryRelation [secondary_id#60, id#0], StorageLevel(disk, memory, deserialized, 1 replicas)
: : +- *(7) Project [secondary_id#60, id#0]
: : +- Generate explode(split(secondary_ids#1, \|)), [id#0], false, [secondary_id#60]
: : +- *(6) Project [id#0, secondary_ids#1]
: : +- *(6) SortMergeJoin [id#0], [guid#46], Inner
: : :- *(2) Sort [id#0 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#0, 200)
: : : +- *(1) Filter (isnotnull(id#0) && id#0 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12})
: : : +- InMemoryTableScan [id#0, secondary_ids#1], [isnotnull(id#0), id#0 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}]
: : : +- InMemoryRelation [id#0, secondary_ids#1], StorageLevel(disk, memory, deserialized, 1 replicas)
: : : +- Exchange RoundRobinPartitioning(3840)
: : : +- *(1) Filter AtLeastNNulls(n, id#0,secondary_ids#1)
: : : +- *(1) FileScan csv [id#0,secondary_ids#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[inputdata_file, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,secondary_ids:string>
: : +- *(5) Sort [guid#46 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(guid#46, 200)
: : +- *(4) Filter (guid#46 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12} && isnotnull(guid#46))
: : +- Generate explode(set_guid#36), false, [guid#46]
: : +- *(3) Project [set_guid#36]
: : +- *(3) Filter (isnotnull(allowed#39) && (allowed#39 = 1))
: : +- *(3) FileScan orc whitelist.whitelist1[set_guid#36,region#39,timestamp#43] Batched: false, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://file, PartitionCount: 1, PartitionFilters: [isnotnull(timestamp#43), (timestamp#43 = 18567)], PushedFilters: [IsNotNull(region), EqualTo(region,1)], ReadSchema: struct<set_guid:array<string>,region:int>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
FROM TAG as T
JOIN MAP as M
ON T.tag_id = M.tag_id
WHERE (expire >= NOW() OR expire IS NULL)
ORDER BY T.tag_id) AS subset) [numPartitions=1] [secondary_id#100,tag_id#101L] PushedFilters: [*IsNotNull(secondary_id), *IsNotNull(tag_id)], ReadSchema: struct<secondary_id:string,tag_id:bigint>
+- *(5) Sort [user_uuid#129 ASC NULLS FIRST, cast(key#131 as bigint) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(user_uuid#129, cast(key#131 as bigint), 200)
+- *(4) Project [user_uuid#129, key#131]
+- *(4) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user_uuid#129,key#131] PushedFilters: [*EqualTo(type,2)], ReadSchema: struct<user_uuid:string,key:string>
Im not getting the direct joins working which is causing time outs.
Update 2
I think this isnt resolving to direct joins as my datatypes in the dataframes are off. Specifically the uuid type
Instead of using RDD API with PySpark, I suggest to take Spark Cassandra Connector (SCC) 2.5.x or 3.0.x (release announcement) that contain the implementation of the join of Dataframe with Cassandra - in this case you won't need to go down to RDDs, but just use normal Dataframe API joins.
Please note that this is not enabled by default, so you will need to start your pyspark or spark-submit with special configuration, like this:
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
You can find more about joins with Cassandra in my recent blog post on this topic (although it uses Scala, Dataframe part should be translated almost one to one to PySpark)

Apache Spark 2.2: broadcast join not working when you already cache the dataframe which you want to broadcast

I have mulitple large dataframes(around 30GB) called as and bs, a relatively small dataframe(around 500MB ~ 1GB) called spp.
I tried to cache spp into memory in order to avoid reading data from database or files multiple times.
But I find if I cache spp, the physical plan shows it won't use broadcast join even though spp is enclosed by broadcast function.
However, If I unpersist the spp, the plan shows it uses broadcast join.
Anyone familiar with this?
scala> spp.cache
res38: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> val as = acs.join(broadcast(spp), $"idsegment" === $"idAdnetProductSegment")
as: org.apache.spark.sql.DataFrame = [idsegmentpartner: bigint, ssegmentsource: string ... 44 more fields]
scala> as.explain
== Physical Plan ==
*SortMergeJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner
:- *Sort [idsegment#286L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(idsegment#286L, 200)
: +- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- *Sort [idAdnetProductSegment#91L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idAdnetProductSegment#91L, 200)
+- *Filter isnotnull(idAdnetProductSegment#91L)
+- InMemoryTableScan [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], [isnotnull(idAdnetProductSegment#91L)]
+- InMemoryRelation [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
scala> spp.unpersist
res40: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> as.explain
== Physical Plan ==
*BroadcastHashJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner, BuildRight
:- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[4, bigint, true]))
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] PushedFilters: [*IsNotNull(idAdnetProductSegment)], ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
This happens when the Analyzed plan tries to use the cache data. It swallows the ResolvedHint information supplied by the user(code).
If we try to do a df.explain(true), we will see that hint is lost between Analyzed and optimized plan, which is where Spark tries to use the cached data.
This issue has been fixed in the latest version of Spark(in multiple attempts).
latest jira: https://issues.apache.org/jira/browse/SPARK-27674 .
Code where the fix(to consider the hint when using cached tables) : https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L219

Detected cartesian product for INNER join on literal column in PySpark

The following code raises "Detected cartesian product for INNER join" exception:
first_df = spark.createDataFrame([{"first_id": "1"}, {"first_id": "1"}, {"first_id": "1"}, ])
second_df = spark.createDataFrame([{"some_value": "????"}, ])
second_df = second_df.withColumn("second_id", F.lit("1"))
# If the next line is uncommented, then the JOIN is working fine.
# second_df.persist()
result_df = first_df.join(second_df,
first_df.first_id == second_df.second_id,
'inner')
data = result_df.collect()
result_df.explain()
and shows me that the logical plan is as shown below:
Filter (first_id#0 = 1)
+- LogicalRDD [first_id#0], false
and
Project [some_value#2, 1 AS second_id#4]
+- LogicalRDD [some_value#2], false
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
It looks like for a reason there is no a column existing in the JOIN condition for those logical plans when RuleExecutor applies optimization rule set called CheckCartesianProducts (see https://github.com/apache/spark/blob/v2.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1114).
But, if I use "persist" method before JOIN it works and the Physical Plan is:
*(3) SortMergeJoin [first_id#0], [second_id#4], Inner
:- *(1) Sort [first_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(first_id#0, 10)
: +- Scan ExistingRDD[first_id#0]
+- *(2) Sort [second_id#4 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(second_id#4, 10)
+- InMemoryTableScan [some_value#2, second_id#4]
+- InMemoryRelation [some_value#2, second_id#4], true, 10000, StorageLevel(disk, memory, 1 replicas)
+- *(1) Project [some_value#2, 1 AS second_id#4]
+- Scan ExistingRDD[some_value#2]
So, may be someone can explain internal leading to such results, because persisting the data frame does not look as a solution.
The problem is, that once you persist your data, second_id is incorporated into the cached table and no longer considered constant. As a result planner can no longer infer that the query should be expressed a Cartesian product, and uses standard SortMergeJoin on hash partitioned second_id.
It would be trivial to achieve the same outcome, without persistence, using udf
from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
#pandas_udf('integer', PandasUDFType.SCALAR)
def identity(x):
return x
second_df = second_df.withColumn('second_id', identity(lit(1)))
result_df = first_df.join(second_df,
first_df.first_id == second_df.second_id,
'inner')
result_df.explain()
== Physical Plan ==
*(6) SortMergeJoin [cast(first_id#4 as int)], [second_id#129], Inner
:- *(2) Sort [cast(first_id#4 as int) ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(cast(first_id#4 as int), 200)
: +- *(1) Filter isnotnull(first_id#4)
: +- Scan ExistingRDD[first_id#4]
+- *(5) Sort [second_id#129 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(second_id#129, 200)
+- *(4) Project [some_value#6, pythonUDF0#154 AS second_id#129]
+- ArrowEvalPython [identity(1)], [some_value#6, pythonUDF0#154]
+- *(3) Project [some_value#6]
+- *(3) Filter isnotnull(pythonUDF0#153)
+- ArrowEvalPython [identity(1)], [some_value#6, pythonUDF0#153]
+- Scan ExistingRDD[some_value#6]
However SortMergeJoin is not what you should try to achieve here. With constant key, it would result in an extreme data skew, and likely fail, on anything but toy data.
Cartesian Product however, as expensive as it is, won't suffer from this issue, and should be preferred here. So it would recommend enabling cross joins or using explicit cross join syntax (spark.sql.crossJoin.enabled for Spark 2.x) and move on.
A pending question remains how to prevent undesired behavior when data is cached. Unfortunately I don't have an answer ready for that. I fairly sure it is possible to use custom optimizer rules, but this is not something that can be done with Python alone.

Efficiently caching data frames in Spark SQL

The use-case is to self-join a table multiple times.
// Hive Table
val network_file = spark.sqlContext.sql("SELECT * FROM
test.network_file")
// Cache
network_file.cache()
network_file.createOrReplaceTempView("network_design")
Now the following query does self-join multiple times.
val res = spark.sqlContext.sql("""select
one.sourcehub as source,
one.mappedhub as first_leg,
two.mappedhub as second_leg,
one.destinationhub as dest
from
(select * from network_design) one JOIN
(select * from network_design) two JOIN
(select * from network_design) three
ON (two.sourcehub = one.mappedhub )
AND (three.sourcehub = two.mappedhub)
AND (one.destinationhub = two.destinationhub )
AND (two.destinationhub = three.destinationhub)
group by source, first_leg, second_leg, dest
""")
Problem is that the Physical Plan of above query suggests on reading the table three times.
== Physical Plan ==
*HashAggregate(keys=[sourcehub#83, mappedhub#85, mappedhub#109, destinationhub#84], functions=[])
+- Exchange hashpartitioning(sourcehub#83, mappedhub#85, mappedhub#109, destinationhub#84, 200)
+- *HashAggregate(keys=[sourcehub#83, mappedhub#85, mappedhub#109, destinationhub#84], functions=[])
+- *Project [sourcehub#83, destinationhub#84, mappedhub#85, mappedhub#109]
+- *BroadcastHashJoin [mappedhub#109, destinationhub#108], [sourcehub#110, destinationhub#111], Inner, BuildRight
:- *Project [sourcehub#83, destinationhub#84, mappedhub#85, destinationhub#108, mappedhub#109]
: +- *BroadcastHashJoin [mappedhub#85, destinationhub#84], [sourcehub#107, destinationhub#108], Inner, BuildRight
: :- *Filter (isnotnull(destinationhub#84) && isnotnull(mappedhub#85))
: : +- InMemoryTableScan [sourcehub#83, destinationhub#84, mappedhub#85], [isnotnull(destinationhub#84), isnotnull(mappedhub#85)]
: : +- InMemoryRelation [sourcehub#83, destinationhub#84, mappedhub#85], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: : +- HiveTableScan [sourcehub#0, destinationhub#1, mappedhub#2], HiveTableRelation `test`.`network_file`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [sourcehub#0, destinationhub#1, mappedhub#2]
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false], input[1, string, false]))
: +- *Filter ((isnotnull(sourcehub#107) && isnotnull(destinationhub#108)) && isnotnull(mappedhub#109))
: +- InMemoryTableScan [sourcehub#107, destinationhub#108, mappedhub#109], [isnotnull(sourcehub#107), isnotnull(destinationhub#108), isnotnull(mappedhub#109)]
: +- InMemoryRelation [sourcehub#107, destinationhub#108, mappedhub#109], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- HiveTableScan [sourcehub#0, destinationhub#1, mappedhub#2], HiveTableRelation `test`.`network_file`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [sourcehub#0, destinationhub#1, mappedhub#2]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false], input[1, string, false]))
+- *Filter (isnotnull(sourcehub#110) && isnotnull(destinationhub#111))
+- InMemoryTableScan [sourcehub#110, destinationhub#111], [isnotnull(sourcehub#110), isnotnull(destinationhub#111)]
+- InMemoryRelation [sourcehub#110, destinationhub#111, mappedhub#112], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- HiveTableScan [sourcehub#0, destinationhub#1, mappedhub#2], HiveTableRelation `test`.`network_file`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [sourcehub#0, destinationhub#1, mappedhub#2]
Shouldn't the Spark cache the table once and not read it multiple times?
How can we efficiently cache tables in spark for these self-join cases?
Spark Version - 2.2
Hive ORC is the store downstream.
This sequence of statements ignores the data frame that is to be cached:
network_file.cache() #the result of this is not being used at all
network_file.createOrReplaceTempView("network_design") #doesn't have the cached DF in lineage
You should either overwrite the variable or register the table on the returned data frame:
network_file = network_file.cache()
network_file.createOrReplaceTempView("network_design")
Or:
network_file.cache().createOrReplaceTempView("network_design")

Spark SQL - change query plan to bushy plan

I want to execute the following sql query in Spark SQL:
sqlContext.sql("SELECT c.name, c.nationkey, n.name, l.orderkey, o.orderdate
FROM customers c, nations n, orders o, lineitems l
WHERE n.nationkey=20 AND c.nationkey=n.nationkey AND c.custkey=o.custkey AND o.orderkey=l.orderkey");
Thus, 3 joins are to perform.
Catalyst, the Query-Analyzer and Optimizer in Spark SQL, returns following Optimized Logical and Physical Plans:
== Optimized Logical Plan ==
Project [name#5,nationkey#6,name#25,orderkey#14,orderdate#31]
+- Join Inner, Some((orderkey#32 = orderkey#14))
:- Project [orderdate#31,nationkey#6,name#5,name#25,orderkey#32]
: +- Join Inner, Some((custkey#3 = custkey#30))
: :- Project [name#25,custkey#3,nationkey#6,name#5]
: : +- Join Inner, Some((nationkey#6 = nationkey#26))
: : :- Project [custkey#3,nationkey#6,name#5]
: : : +- LogicalRDD [acctbal#0,address#1,comment#2,custkey#3,mktsegment#4,name#5,nationkey#6,phone#7], MapPartitionsRDD[3] at createDataFrame at Query.java:66
: : +- Project [nationkey#26,name#25]
: : +- Filter (nationkey#26 = 20)
: : +- LogicalRDD [comment#24,name#25,nationkey#26,regionkey#27], MapPartitionsRDD[11] at createDataFrame at Query.java:76
: +- Project [orderkey#32,orderdate#31,custkey#30]
: +- LogicalRDD [clerk#28,comment#29,custkey#30,orderdate#31,orderkey#32,orderpriority#33,orderstatus#34,shippriority#35,totalprice#36], MapPartitionsRDD[15] at createDataFrame at Query.java:81
+- Project [orderkey#14]
+- LogicalRDD [comment#8,commitdate#9,discount#10,extendedprice#11,linenumber#12,linestatus#13,orderkey#14,partkey#15,quantity#16,receiptdate#17,returnflag#18,shipdate#19,shipinstruct#20,shipmode#21,suppkey#22,tax#23], MapPartitionsRDD[7] at createDataFrame at Query.java:71
== Physical Plan ==
Project [name#5,nationkey#6,name#25,orderkey#14,orderdate#31]
+- SortMergeJoin [orderkey#32], [orderkey#14]
:- Sort [orderkey#32 ASC], false, 0
: +- TungstenExchange hashpartitioning(orderkey#32,200), None
: +- Project [orderdate#31,nationkey#6,name#5,name#25,orderkey#32]
: +- SortMergeJoin [custkey#3], [custkey#30]
: :- Sort [custkey#3 ASC], false, 0
: : +- TungstenExchange hashpartitioning(custkey#3,200), None
: : +- Project [name#25,custkey#3,nationkey#6,name#5]
: : +- SortMergeJoin [nationkey#6], [nationkey#26]
: : :- Sort [nationkey#6 ASC], false, 0
: : : +- TungstenExchange hashpartitioning(nationkey#6,200), None
: : : +- Project [custkey#3,nationkey#6,name#5]
: : : +- Scan ExistingRDD[acctbal#0,address#1,comment#2,custkey#3,mktsegment#4,name#5,nationkey#6,phone#7]
: : +- Sort [nationkey#26 ASC], false, 0
: : +- TungstenExchange hashpartitioning(nationkey#26,200), None
: : +- Project [nationkey#26,name#25]
: : +- Filter (nationkey#26 = 20)
: : +- Scan ExistingRDD[comment#24,name#25,nationkey#26,regionkey#27]
: +- Sort [custkey#30 ASC], false, 0
: +- TungstenExchange hashpartitioning(custkey#30,200), None
: +- Project [orderkey#32,orderdate#31,custkey#30]
: +- Scan ExistingRDD[clerk#28,comment#29,custkey#30,orderdate#31,orderkey#32,orderpriority#33,orderstatus#34,shippriority#35,totalprice#36]
+- Sort [orderkey#14 ASC], false, 0
+- TungstenExchange hashpartitioning(orderkey#14,200), None
+- Project [orderkey#14]
+- Scan ExistingRDD[comment#8,commitdate#9,discount#10,extendedprice#11,linenumber#12,linestatus#13,orderkey#14,partkey#15,quantity#16,receiptdate#17,returnflag#18,shipdate#19,shipinstruct#20,shipmode#21,suppkey#22,tax#23]
As you can see, the query plan is a left deep plan:
(Join(Join(Join(nationkey#6 = nationkey#26), custkey), orderkey))
Theoretically, in this case, a bushy plan could also be executed:
Join (over custkey)
/ \
Join(nationkey#6 = nationkey#26) Join(orderkey#32 = orderkey#14))
This would allow to execute 2 joins in parallel.
The question is: (How) Is it possible to manipulate Catalyst to generate bushy plans and run the join-leafs in parallel?
My motivation is to run independant (small or fast) joins in parallel instead of sequentially processing multiple joins and thus waiting for stranglers.

Resources