I have a question about Spark DataFrame partitioning, I'm currently using Spark 1.6 for project requirements.This is my code excerpt:
sqlContext.getConf("spark.sql.shuffle.partitions") // 6
val df = sc.parallelize(List(("A",1),("A",4),("A",2),("B",5),("C",2),("D",2),("E",2),("B",7),("C",9),("D",1))).toDF("id_1","val_1")
df.rdd.getNumPartitions // 4
val df2 = sc.parallelize(List(("B",1),("E",4),("H",2),("J",5),("C",2),("D",2),("F",2))).toDF("id_2","val_2")
df2.rdd.getNumPartitions // 4
val df3 = df.join(df2,$"id_1" === $"id_2")
df3.rdd.getNumPartitions // 6
val df4 = df3.repartition(3,$"id_1")
df4.rdd.getNumPartitions // 3
The following is the explain plan has been created:
== Parsed Logical Plan ==
'RepartitionByExpression ['id_1], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Analyzed Logical Plan ==
id_1: string, val_1: int, id_2: string, val_2: int
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Optimized Logical Plan ==
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Physical Plan ==
TungstenExchange hashpartitioning(id_1#42,3), None
+- SortMergeJoin [id_1#42], [id_2#46]
:- Sort [id_1#42 ASC], false, 0
: +- TungstenExchange hashpartitioning(id_1#42,6), None
: +- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- Scan ExistingRDD[_1#40,_2#41]
+- Sort [id_2#46 ASC], false, 0
+- TungstenExchange hashpartitioning(id_2#46,6), None
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- Scan ExistingRDD[_1#44,_2#45]
As far I know, DataFrame represent an abstraction interface over RDD, so partitioning should be delegated to the Catalyst optimizer.
Infact compared to RDD where many transformations accept a number of partitions parameter, in order to optimize co-partitioning and co-locating whenever possible, with DataFrame the only chance to alter partitioning, is invoking the method repartition, otherwise the number of partitions for join and aggregations is inferred using the configuration param spark.sql.shuffle.partitions.
From what I can see and understand from the explain plan above it seems there is an useless repartition(so shuffle indeed) to 6 (the default value) after then repartitioning again to the final value imposed by the method repartition.
I believe the Optimizer could change the number of partitions of the join to the final value of 3.
Could someone help me to clarify that point? Maybe I missing something.

If you use spark sql, your shuffle partitions is always equal to spark.sql.shufle.partitions.But if you enable this spark.sql.adaptive.enabled it will add EchangeCoordinator.Right now, the work of this coordinator is to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages.


Spark not use DirectJoin over DSE

I'm developing a Spark streaming task that joins data from stream with a Cassandra Table. As you can see in Explain Plan Direct Join is not used.
According to DSE doc Direct Join is used when (table size * directJoinSizeRatio) > size of keys.
In my case Table has millions of record and keys are only one record (form streaming), so i'm expecting Diret Join is used.
Table radice_polizza has only id_cod_polizza column as partition jey.
Connector version:2.5.1.
DSE version: 6.7.6.
*Project [id_cod_polizza#86L, progressivo#11, id3_numero_polizza#25, id3_cod_compagnia#21]
+- *SortMergeJoin [id_cod_polizza#86L], [id_cod_polizza#10L], Inner
:- *Sort [id_cod_polizza#86L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id_cod_polizza#86L, 200)
: +- *Project [value#84L AS id_cod_polizza#86L]
: +- *SerializeFromObject [input[0, bigint, false] AS value#84L]
: +- Scan ExternalRDDScan[obj#83L]
+- *Sort [id_cod_polizza#10L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id_cod_polizza#10L, 200)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [id_cod_polizza#10L,progressivo#11,id3_numero_polizza#25,id3_cod_compagnia#21] ReadSchema: struct<id_cod_polizza:bigint,progressivo:string,id3_numero_polizza:string,id3_cod_compagnia:string>
Here is my code:
var radice_polizza = spark
.options(Map("table" -> "radice_polizza", "keyspace" -> "preferred_temp"))
if(mode == LoadMode.DIFF){
val altered_data_df = altered_data.idCodPolizzaList.toDF("id_cod_polizza")
radice_polizza = altered_data_df.join(radice_polizza, Seq("id_cod_polizza"))
Forcing Direct Join it works.
radice_polizza = altered_data_df.join(radice_polizza.directJoin(AlwaysOn), Seq("id_cod_polizza"))
== Physical Plan ==
*Project [id_cod_polizza#58L, progressivo#11, id3_numero_polizza#25, id3_cod_compagnia#21]
+- DSE Direct Join [id_cod_polizza = id_cod_polizza#58L] preferred_temp.radice_polizza - Reading (id_cod_polizza, progressivo, id3_numero_polizza, id3_cod_compagnia) Pushed {}
+- *Project [value#56L AS id_cod_polizza#58L]
+- *SerializeFromObject [input[0, bigint, false] AS value#56L]
+- Scan ExternalRDDScan[obj#55L]
Why Direct Join is not used automatically?
Thnak you
DSE Direct Join is enabled automatically when you're developing application using DSE Analytics dependencies that are provided when you run your job on DSE Analytics. You need to specify following dependency for that, and don't use Spark Cassandra Connector:
if you run your job on external Spark, then you need to explicitly enable direct join by specifying Spark configuration property spark.sql.extensions with value of com.datastax.spark.connector.CassandraSparkExtensions.
I have a long blog post on the joining data with Cassandra that explains all this things.

Apache Spark 2.2: broadcast join not working when you already cache the dataframe which you want to broadcast

I have mulitple large dataframes(around 30GB) called as and bs, a relatively small dataframe(around 500MB ~ 1GB) called spp.
I tried to cache spp into memory in order to avoid reading data from database or files multiple times.
But I find if I cache spp, the physical plan shows it won't use broadcast join even though spp is enclosed by broadcast function.
However, If I unpersist the spp, the plan shows it uses broadcast join.
Anyone familiar with this?
scala> spp.cache
res38: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> val as = acs.join(broadcast(spp), $"idsegment" === $"idAdnetProductSegment")
as: org.apache.spark.sql.DataFrame = [idsegmentpartner: bigint, ssegmentsource: string ... 44 more fields]
scala> as.explain
== Physical Plan ==
*SortMergeJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner
:- *Sort [idsegment#286L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(idsegment#286L, 200)
: +- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- *Sort [idAdnetProductSegment#91L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idAdnetProductSegment#91L, 200)
+- *Filter isnotnull(idAdnetProductSegment#91L)
+- InMemoryTableScan [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], [isnotnull(idAdnetProductSegment#91L)]
+- InMemoryRelation [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
scala> spp.unpersist
res40: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> as.explain
== Physical Plan ==
*BroadcastHashJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner, BuildRight
:- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[4, bigint, true]))
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] PushedFilters: [*IsNotNull(idAdnetProductSegment)], ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
This happens when the Analyzed plan tries to use the cache data. It swallows the ResolvedHint information supplied by the user(code).
If we try to do a df.explain(true), we will see that hint is lost between Analyzed and optimized plan, which is where Spark tries to use the cached data.
This issue has been fixed in the latest version of Spark(in multiple attempts).
latest jira: .
Code where the fix(to consider the hint when using cached tables) :

Detected cartesian product for INNER join on literal column in PySpark

The following code raises "Detected cartesian product for INNER join" exception:
first_df = spark.createDataFrame([{"first_id": "1"}, {"first_id": "1"}, {"first_id": "1"}, ])
second_df = spark.createDataFrame([{"some_value": "????"}, ])
second_df = second_df.withColumn("second_id", F.lit("1"))
# If the next line is uncommented, then the JOIN is working fine.
# second_df.persist()
result_df = first_df.join(second_df,
first_df.first_id == second_df.second_id,
data = result_df.collect()
and shows me that the logical plan is as shown below:
Filter (first_id#0 = 1)
+- LogicalRDD [first_id#0], false
Project [some_value#2, 1 AS second_id#4]
+- LogicalRDD [some_value#2], false
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
It looks like for a reason there is no a column existing in the JOIN condition for those logical plans when RuleExecutor applies optimization rule set called CheckCartesianProducts (see
But, if I use "persist" method before JOIN it works and the Physical Plan is:
*(3) SortMergeJoin [first_id#0], [second_id#4], Inner
:- *(1) Sort [first_id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(first_id#0, 10)
: +- Scan ExistingRDD[first_id#0]
+- *(2) Sort [second_id#4 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(second_id#4, 10)
+- InMemoryTableScan [some_value#2, second_id#4]
+- InMemoryRelation [some_value#2, second_id#4], true, 10000, StorageLevel(disk, memory, 1 replicas)
+- *(1) Project [some_value#2, 1 AS second_id#4]
+- Scan ExistingRDD[some_value#2]
So, may be someone can explain internal leading to such results, because persisting the data frame does not look as a solution.
The problem is, that once you persist your data, second_id is incorporated into the cached table and no longer considered constant. As a result planner can no longer infer that the query should be expressed a Cartesian product, and uses standard SortMergeJoin on hash partitioned second_id.
It would be trivial to achieve the same outcome, without persistence, using udf
from pyspark.sql.functions import lit, pandas_udf, PandasUDFType
#pandas_udf('integer', PandasUDFType.SCALAR)
def identity(x):
return x
second_df = second_df.withColumn('second_id', identity(lit(1)))
result_df = first_df.join(second_df,
first_df.first_id == second_df.second_id,
== Physical Plan ==
*(6) SortMergeJoin [cast(first_id#4 as int)], [second_id#129], Inner
:- *(2) Sort [cast(first_id#4 as int) ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(cast(first_id#4 as int), 200)
: +- *(1) Filter isnotnull(first_id#4)
: +- Scan ExistingRDD[first_id#4]
+- *(5) Sort [second_id#129 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(second_id#129, 200)
+- *(4) Project [some_value#6, pythonUDF0#154 AS second_id#129]
+- ArrowEvalPython [identity(1)], [some_value#6, pythonUDF0#154]
+- *(3) Project [some_value#6]
+- *(3) Filter isnotnull(pythonUDF0#153)
+- ArrowEvalPython [identity(1)], [some_value#6, pythonUDF0#153]
+- Scan ExistingRDD[some_value#6]
However SortMergeJoin is not what you should try to achieve here. With constant key, it would result in an extreme data skew, and likely fail, on anything but toy data.
Cartesian Product however, as expensive as it is, won't suffer from this issue, and should be preferred here. So it would recommend enabling cross joins or using explicit cross join syntax (spark.sql.crossJoin.enabled for Spark 2.x) and move on.
A pending question remains how to prevent undesired behavior when data is cached. Unfortunately I don't have an answer ready for that. I fairly sure it is possible to use custom optimizer rules, but this is not something that can be done with Python alone.

Spark SQL - does renaming columns affect partitioning?

I have written an explicitJoin API which renames the columns in a Dataset with either a l_ or r_ prefix to disambiguate and to solve problems with spark lineage, i.e columnName1#77 not found in columnName1#123, columnName2#55....
Part of the code is shown below:
def explicitJoin(other: Dataset[_], joinExpr: Column, joinType: String): ExplicitJoinExt = {
val left = dataset.toDF("l_" + _): _*)
val right = other.toDF("r_" + _): _*)
new ExplicitJoinExt(left.join(right, joinExpr, joinType))
Users may then pass a join expressions such as $"l_columnName1" === $"r_columnName1" && ... so that they are 100% explicit about what columns they are joining on.
I am experiencing a new issue where partitions are too large to load into memory (org.apache.spark.shuffle.FetchFailedException: Too large frame....) yet there was no problem reading the input (partitioned) Datasets.
Can renaming columns affect the underlying parititioning of the input Datasets/DataFrames?
Example 1 - regular join
case class A(a: Int, b: String)
val l = (0 to 1000000).map(i => A(i, i.toString))
val r = (0 to 1000000).map(i => A(i, i.toString))
val ds1 =[A].repartition(100, $"a")
val ds2 =[A].repartition(100, $"a")
val joined = ds1.join(ds2, Seq("a"), "inner")
== Physical Plan ==
*Project [a#2, b#3, b#15]
+- *SortMergeJoin [a#2], [a#14], Inner
:- *Sort [a#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#2, 100)
: +- LocalTableScan [a#2, b#3]
+- *Sort [a#14 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#14, b#15], Exchange hashpartitioning(a#2, 100)
Example 2 - Using my (possibly misguided) ExplicitJoinExt involving renames
val joined = ds1
.explicitJoin(ds2, $"l_a" === $"r_a", "inner") // Pimped on conversion to ExplicitJoin type, columns prefixed by l_ or r_. DS joined by expr and join type
.selectLeft // Select just left prefixed columns
.toDF // Convert back from ExplicitJoinExpr to DF
== Physical Plan ==
*Project [l_a#24 AS a#53, l_b#25 AS b#54]
+- *BroadcastHashJoin [l_a#24], [r_a#29], Inner, BuildRight
:- *Project [a#2 AS l_a#24, b#3 AS l_b#25]
: +- Exchange hashpartitioning(a#2, 100)
: +- LocalTableScan [a#2, b#3]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *Project [a#14 AS r_a#29]
+- Exchange hashpartitioning(a#14, 100)
+- LocalTableScan [a#14]
So, for the second join it would apear that we are repartitioning again - correct?
NO, I checked in SPARK 2.3.1. Renaming does not affect partitioning, at least not in this approach:
val ds11 = ds1.repartition(4)
NO, I checked this also. Renaming does not affect partitioning, at least not in this approach:
val ds11 = ds1.repartition(2, $"cityid")
EXPLAIN Output for:
val j = left.join(right, $"l_personid" === $"r_personid", "inner").explain
​reveals, in my case 2 and 4 as number of partitions:
== Physical Plan ==
*(2) BroadcastHashJoin [l_personid#641], [r_personid#647], Inner,
BuildRight, false
:- *(2) Project [personid#612 AS l_personid#641, personname#613 AS
l_personname#642, cityid#614 AS l_cityid#643]
: +- Exchange hashpartitioning(cityid#614, 2)
: +- LocalTableScan [personid#612, personname#613, cityid#614]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *(1) Project [personid#612 AS r_personid#647, personname#613 AS r_personname#648, cityid#614 AS r_cityid#649]
+- Exchange hashpartitioning(personid#612, 4)
+- LocalTableScan [personid#612, personname#613, cityid#614]
One can see that the renamed cols are mapped back to their original names.
In a test on a post elsewhere we were able to ascertain that new actions relying on AGGRegations or JOINs will default to 200 unless
sqlContext.setConf("spark.sql.shuffle.partitions", "some val")
is issued in the code setting this to the required value. If it is a small set of data being JOINed, etc. then the results may differ.
For those still encountering this issue: renaming columns does affect partitioning in Spark < 3.0.
Seq((1, 2))
.toDF("a", "b")
.withColumnRenamed("b", "c")
Gives the following plan:
== Physical Plan ==
Exchange hashpartitioning(c#40, 10)
+- *(1) Project [a#36, b#37 AS c#40]
+- Exchange hashpartitioning(b#37, 10)
+- LocalTableScan [a#36, b#37]
This was fixed in this PR.

Spark SQL : can I get total map reduce steps when spark runs it's sql?

When I run
spark.sql("select bill_no, count(icode) from bigmart.o_sales group by bill_no").explain(true);
I get only this much explaination
== Parsed Logical Plan ==
'Aggregate ['bill_no], ['bill_no AS bill#0, 'count('icode) AS icode#1]
+- 'UnresolvedRelation `bigmart`.`o_sales`
== Analyzed Logical Plan ==
bill: string, icode: bigint
Aggregate [bill_no#15], [bill_no#15 AS bill#0, count(icode#12) AS icode#1L]
+- MetastoreRelation bigmart, o_sales
== Optimized Logical Plan ==
Aggregate [bill_no#15], [bill_no#15 AS bill#0, count(icode#12) AS icode#1L]
+- Project [icode#12, bill_no#15]
+- MetastoreRelation bigmart, o_sales
== Physical Plan ==
*HashAggregate(keys=[bill_no#15], functions=[count(icode#12)], output=[bill#0, icode#1L])
+- Exchange hashpartitioning(bill_no#15, 200)
+- *HashAggregate(keys=[bill_no#15], functions=[partial_count(icode#12)], output=[bill_no#15, count#19L])
+- HiveTableScan [icode#12, bill_no#15], MetastoreRelation bigmart, o_sales
Is it all explain() can offer? or is there other methods that gives more details. I want to learn how map and reduce is done behind the scene by spark.
