perform massively parallelizable tasks in pyspark - apache-spark

I often find myself performing massively parallelizable tasks in spark, but for some reason, spark keeps on dying. For instance, right now I have two tables (both stored on s3) that are essentially just collections of (unique) strings. I want to cross join, compute levenshtein distance, and write it out to s3 as a new table. So my code looks like:
OUT_LOC = 's3://<BUCKET>/<PREFIX>/'
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName('my-app') \
.config("hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
tb0 = spark.sql('SELECT col0 FROM db0.table0')
tb1 = spark.sql('SELECT col1 FROM db1.table1')
spark.sql("set spark.sql.crossJoin.enabled=true")
tb0.join(tb1).withColumn("levenshtein_distance",
F.levenshtein(F.col("col0"), F.col("col1"))) \
.write.format('parquet').mode('overwrite') \
.options(path=OUT_LOC, compression='snappy', maxRecordsPerFile=10000) \
.saveAsTable('db2.new_table')
It seems to me that this is massively parallelizable, and spark should be able to chug through this while only reading in a minimal amount of data at a time. But for some reason, the tasks keep on ghost dying.
So my questions are:
Is there a setting I'm missing? Or just more generally, what's going on here?
There's no reason for the whole thing to be stored locally, right?
What are some best practices here that I should consider?
For what it's worth, I have googled around extensively, but couldn't find anyone else with this issue. Maybe my google-foo isn't strong enough, or maybe I'm just doing something stupid.
edit
To #egordoe's advice...
I ran the explain and got back the following...
== Parsed Logical Plan ==
'Project [col0#0, col1#3, levenshtein('col0, 'col1) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Analyzed Logical Plan ==
col0: string, col1: string, levenshtein_distance: int
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Optimized Logical Plan ==
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Relation[col0#0] parquet
+- Relation[col1#3] parquet
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- BroadcastNestedLoopJoin BuildRight, Inner
:- FileScan parquet db0.table0[col0#0] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col0:string>
+- BroadcastExchange IdentityBroadcastMode
+- FileScan parquet db1.table1[col1#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:string>
========== finished
Seems reasonable to me, but the explanation doesn't include the actual writing of the data. I assume that's because it likes to build up a cache of results locally then ship the whole thing to s3 as a table after? That would be pretty lame.
edit 1
I also ran the foreach example you suggested with a simple print statement in there. It hung around for 40 minutes without printing anything before I killed it. I'm now running the job with a function that does nothing (it's just a pass statement) to see if it even finishes.

Related

Applying Pandas UDF without shuffling

I am trying to apply a pandas UDF to each partition of a Spark (3.3.0) DataFrame separately so as to avoid any shuffling requirements. However, when I run the query below, a lot of data is getting shuffled around. The execution plan contains a SORT stage; this might be the culprit.
from pyspark.sql.functions import spark_partition_id
query = df.groupBy(spark_partition_id())\
.applyInPandas(lambda x: pd.DataFrame([x.shape]), "n_rows long, n_cols long")
query.explain()
Output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- FlatMapGroupsInPandas [SPARK_PARTITION_ID()#1562], <lambda>(id#0L, date#1L, feature#2, partition_id#926)#1561, [nr#1563L, nc#1564L]
+- Sort [SPARK_PARTITION_ID()#1562 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(SPARK_PARTITION_ID()#1562, 200), ENSURE_REQUIREMENTS, [id=#748]
+- Project [SPARK_PARTITION_ID() AS SPARK_PARTITION_ID()#1562, id#0L, date#1L, feature#2, partition_id#926]
+- Scan ExistingRDD[id#0L,date#1L,feature#2,partition_id#926]
In contrast, if I request the execution plan for a very similar query below, the SORT stage is not there and I detect no shuffling upon execution.
df.groupBy(spark_partition_id()).count().explain()
Output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[_nondeterministic#1532], functions=[count(1)])
+- Exchange hashpartitioning(_nondeterministic#1532, 200), ENSURE_REQUIREMENTS, [id=#704]
+- HashAggregate(keys=[_nondeterministic#1532], functions=[partial_count(1)])
+- Project [SPARK_PARTITION_ID() AS _nondeterministic#1532]
+- Scan ExistingRDD[id#0L,date#1L,feature#2,partition_id#926]
What is happening here and how do I achieve the goal I had stated? Thank you!
After some tinkering, it seems I am able to do what I want as follows, although probably it is not ideal.
spark_session.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch",'0')
def get_shape(iterator):
for pdf in iterator:
yield pd.DataFrame([pdf.shape])
df.mapInPandas(get_shape, "nr long, nc long").toPandas()

pyspark joinWithCassandraTable refactor without maps

Im new to using spark/scala here and im having trouble with a refactor of some of my code here. Im running Scala 2.11 using pyspark and in a spark/yarn setup. The following is working but id like to clean it up, and to get the max performance out of this. I read elsewhere that pyspark udf and lambdas can cause huge performance impact so im trying to reduce or remove them were possible.
# Reduce ingest df1 data by joining on allowed table df2
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.rdd\
.map(lambda r: Row(tag=r['tag_id'], user_uuid=r['user_uuid']))
# Type column fixed to type=2, and tag==key
ready_to_join = to_process.map(lambda r: (r[0], 2, r[1]))
# Join with cassandra table to find matches
exists_in_cass = ready_to_join\
.joinWithCassandraTable(keyspace, table3)\
.on("user_uuid", "type")\
.select("user_uuid")
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
the cassandra table is such that
CREATE TABLE keyspace.table3 (
user_uuid uuid,
type int,
key text,
value text,
PRIMARY KEY (user_uuid, type, key)
) WITH CLUSTERING ORDER BY (type ASC, key ASC)
currently ive got
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.select(col("user_uuid"), col("tag_id").alias("tag"))
ready_to_join = to_process\
.withColumn("type", sf.lit(2))\
.select('user_uuid', 'type', col('tag').alias("key"))\
.rdd\
.map(lambda x: Row(x))
# planning on using repartitionByCassandraReplica here after I get it logically working
exists_in_cass = ready_to_join\
.joinWithCassandraTable(keyspace, table3)\
.on("user_uuid", "type")\
.select("user_uuid")
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
but im getting errors like
2020-10-30 15:10:42 WARN TaskSetManager:66 - Lost task 148.0 in stage 22.0 (TID ----, ---, executor 9): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
looking help from any spark gurus out there to point me to anything stupid I am doing here.
Update
Thanks to Alex's suggestion using the spark-cassandra-connector v2.5+ gives the ability for dataframes to join directly. I updated my code to use this instead.
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.select(col("user_uuid"), col("tag_id").alias("tag"))
ready_to_join = to_process\
.withColumn("type", sf.lit(2))\
.select(col('user_uuid').alias('c1_user_uuid'), 'type', col('tag').alias("key"))\
cass_table = spark_session
.read \
.format("org.apache.spark.sql.cassandra") \
.options(table=config.table, keyspace=config.keyspace) \
.load()
exists_in_cass = ready_to_join\
.join(
cass_table,
[(cass_table["user_uuid"] == ready_to_join["c1_user_uuid"]) &
(cass_table["key"] == ready_to_join["key"]) &
(cass_table["type"] == ready_to_join["type"])])\
.select(col("c1_user_uuid").alias("user_uuid"))
exists_in_cass.explain()
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
As far as I know, in theory this should be alot faster ! But im getting errors during runtime with the database timing out.
WARN TaskSetManager:66 - Lost task 827.0 in stage 12.0 (TID 9946, , executor 4): java.io.IOException: Exception during execution of SELECT "user_uuid", "key" FROM "keyspace"."table3" WHERE token("user_uuid") > ? AND token("user_uuid") <= ? AND "type" = ? ALLOW FILTERING: Query timed out after PT2M
TaskSetManager:66 - Lost task 125.0 in stage 12.0 (TID 9215, , executor 7): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2M
etc
I have the config for spark setup to allow for the spark extensions
--packages mysql:mysql-connector-java:5.1.47,com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
The DAG from spark shows all nodes completely maxed out. Should I be partitioning my data before running my join here?
The explain for this also doesnt show a direct join (explain has more code than snippet above)
== Physical Plan ==
*(6) Project [c1_user_uuid#124 AS user_uuid#158]
+- *(6) SortMergeJoin [c1_user_uuid#124, key#125L], [user_uuid#129, cast(key#131 as bigint)], Inner
:- *(3) Sort [c1_user_uuid#124 ASC NULLS FIRST, key#125L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(c1_user_uuid#124, key#125L, 200)
: +- *(2) Project [id#0 AS c1_user_uuid#124, tag_id#101L AS key#125L]
: +- *(2) BroadcastHashJoin [secondary_id#60], [secondary_id#100], Inner, BuildRight
: :- *(2) Filter (isnotnull(secondary_id#60) && isnotnull(id#0))
: : +- InMemoryTableScan [secondary_id#60, id#0], [isnotnull(secondary_id#60), isnotnull(id#0)]
: : +- InMemoryRelation [secondary_id#60, id#0], StorageLevel(disk, memory, deserialized, 1 replicas)
: : +- *(7) Project [secondary_id#60, id#0]
: : +- Generate explode(split(secondary_ids#1, \|)), [id#0], false, [secondary_id#60]
: : +- *(6) Project [id#0, secondary_ids#1]
: : +- *(6) SortMergeJoin [id#0], [guid#46], Inner
: : :- *(2) Sort [id#0 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#0, 200)
: : : +- *(1) Filter (isnotnull(id#0) && id#0 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12})
: : : +- InMemoryTableScan [id#0, secondary_ids#1], [isnotnull(id#0), id#0 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}]
: : : +- InMemoryRelation [id#0, secondary_ids#1], StorageLevel(disk, memory, deserialized, 1 replicas)
: : : +- Exchange RoundRobinPartitioning(3840)
: : : +- *(1) Filter AtLeastNNulls(n, id#0,secondary_ids#1)
: : : +- *(1) FileScan csv [id#0,secondary_ids#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[inputdata_file, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,secondary_ids:string>
: : +- *(5) Sort [guid#46 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(guid#46, 200)
: : +- *(4) Filter (guid#46 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12} && isnotnull(guid#46))
: : +- Generate explode(set_guid#36), false, [guid#46]
: : +- *(3) Project [set_guid#36]
: : +- *(3) Filter (isnotnull(allowed#39) && (allowed#39 = 1))
: : +- *(3) FileScan orc whitelist.whitelist1[set_guid#36,region#39,timestamp#43] Batched: false, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://file, PartitionCount: 1, PartitionFilters: [isnotnull(timestamp#43), (timestamp#43 = 18567)], PushedFilters: [IsNotNull(region), EqualTo(region,1)], ReadSchema: struct<set_guid:array<string>,region:int>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
FROM TAG as T
JOIN MAP as M
ON T.tag_id = M.tag_id
WHERE (expire >= NOW() OR expire IS NULL)
ORDER BY T.tag_id) AS subset) [numPartitions=1] [secondary_id#100,tag_id#101L] PushedFilters: [*IsNotNull(secondary_id), *IsNotNull(tag_id)], ReadSchema: struct<secondary_id:string,tag_id:bigint>
+- *(5) Sort [user_uuid#129 ASC NULLS FIRST, cast(key#131 as bigint) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(user_uuid#129, cast(key#131 as bigint), 200)
+- *(4) Project [user_uuid#129, key#131]
+- *(4) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user_uuid#129,key#131] PushedFilters: [*EqualTo(type,2)], ReadSchema: struct<user_uuid:string,key:string>
Im not getting the direct joins working which is causing time outs.
Update 2
I think this isnt resolving to direct joins as my datatypes in the dataframes are off. Specifically the uuid type
Instead of using RDD API with PySpark, I suggest to take Spark Cassandra Connector (SCC) 2.5.x or 3.0.x (release announcement) that contain the implementation of the join of Dataframe with Cassandra - in this case you won't need to go down to RDDs, but just use normal Dataframe API joins.
Please note that this is not enabled by default, so you will need to start your pyspark or spark-submit with special configuration, like this:
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
You can find more about joins with Cassandra in my recent blog post on this topic (although it uses Scala, Dataframe part should be translated almost one to one to PySpark)

left join on a key if there is no match then join on a different right key to get value

I have two spark dataframes, say df_core & df_dict:
There are more cols in df_core but it has nothing to do with the question here
df_core:
id
1_ghi
2_mno
3_xyz
4_abc
df_dict:
id_1 id_2 cost
1_ghi 1_ghi 12
2_mno 2_rst 86
3_def 3_xyz 105
I want to get the value from df_dict.cost by joining the 2 dfs.
Scenario: join on df_core.id == df_dict.id_1
If there is a no match for df_core.id for the foreign key df_dict.id_1 (for above example: 3_xyz) then, the join should happen on df_dict.id_2
I am able to achieve the join for the first key but have not sure about how to achieve the scenario
final_df = df_core.alias("df_core_alias").join(df_dict, df_core.id== df_dict.id_1, 'left').select('df_core_alias.*', df_dict.cost)
The solution need not be a dataframe operation. I can create Temp Views out of the dataframes & then run SQL on it if that's easy and/or optimized.
I also have a SQL solution in-mind (not tested):
SELECT
core.id,
dict.cost
FROM
df_core core LEFT JOIN df_dict dict
ON core.id = dict.id_1
OR core.id = dict.id_2
Expected df:
id cost
1_ghi 12
2_mno 86
3_xyz 105
4_abc
Well the project plan is too big to add in the comment so I've to question here
below is the spark plan for isin:
== Physical Plan ==
*(3) Project [region_type#26, COST#13, CORE_SECTOR_VALUE#21, CORE_ID#22]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, CORE_ID#22 IN (DICT_ID_1#10,DICT_ID_2#11)
:- *(1) Project [CORE_SECTOR_VALUE#21, CORE_ID#22, region_type#26]
: +- *(1) Filter ((((isnotnull(response_value#23) && isnotnull(error_code#19L)) && (error_code#19L = 0)) && NOT (response_value#23 = )) && NOT response_value#23 IN (N.A.,N.D.,N.S.))
: +- *(1) FileScan parquet [ERROR_CODE#19L,CORE_SECTOR_VALUE#21,CORE_ID#22,RESPONSE_VALUE#23,source_system#24,fee_type#25,region_type#26,run_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/outfile/..., PartitionCount: 14, PartitionFilters: [isnotnull(run_date#27), (run_date#27 = 20190905)], PushedFilters: [IsNotNull(RESPONSE_VALUE), IsNotNull(ERROR_CODE), EqualTo(ERROR_CODE,0), Not(EqualTo(RESPONSE_VA..., ReadSchema: struct<ERROR_CODE:bigint,CORE_SECTOR_VALUE:string,CORE_ID:string,RESPONSE_VALUE:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(2) FileScan csv [DICT_ID_1#10,DICT_ID_2#11,COST#13] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/client..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DICT_ID_1:string,DICT_ID_2:string,COST:string>
The Filter in BroadcastNestedLoopJoin is coming from previous df_core transformations but as we know spark's lazy-evaluation, we're seeing it here in the project plan
Moreover, I just realized that the final_df.show() works fine for any solution I use. But what's taking infinite time to process is the next transformation that I'm doing over the final_df which is my actual expected_df. Here's my next transformation:
expected_df = spark.sql("select region_type, cost, core_sector_value, count(core_id) from final_df_view group by region_type, cost, core_sector_value order by region_type, cost, core_sector_value")
& here's the plan for the expected_df:
== Physical Plan ==
*(5) Sort [region_type#26 ASC NULLS FIRST, cost#13 ASC NULLS FIRST, core_sector_value#21 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(region_type#26 ASC NULLS FIRST, cost#13 ASC NULLS FIRST, core_sector_value#21 ASC NULLS FIRST, 200)
+- *(4) HashAggregate(keys=[region_type#26, cost#13, core_sector_value#21], functions=[count(core_id#22)])
+- Exchange hashpartitioning(region_type#26, cost#13, core_sector_value#21, 200)
+- *(3) HashAggregate(keys=[region_type#26, cost#13, core_sector_value#21], functions=[partial_count(core_id#22)])
+- *(3) Project [region_type#26, COST#13, CORE_SECTOR_VALUE#21, CORE_ID#22]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, CORE_ID#22 IN (DICT_ID_1#10,DICT_ID_2#11)
:- *(1) Project [CORE_SECTOR_VALUE#21, CORE_ID#22, region_type#26]
: +- *(1) Filter ((((isnotnull(response_value#23) && isnotnull(error_code#19L)) && (error_code#19L = 0)) && NOT (response_value#23 = )) && NOT response_value#23 IN (N.A.,N.D.,N.S.))
: +- *(1) FileScan parquet [ERROR_CODE#19L,CORE_SECTOR_VALUE#21,CORE_ID#22,RESPONSE_VALUE#23,source_system#24,fee_type#25,region_type#26,run_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/outfile/..., PartitionCount: 14, PartitionFilters: [isnotnull(run_date#27), (run_date#27 = 20190905)], PushedFilters: [IsNotNull(RESPONSE_VALUE), IsNotNull(ERROR_CODE), EqualTo(ERROR_CODE,0), Not(EqualTo(RESPONSE_VA..., ReadSchema: struct<ERROR_CODE:bigint,CORE_SECTOR_VALUE:string,CORE_ID:string,RESPONSE_VALUE:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(2) FileScan csv [DICT_ID_1#10,DICT_ID_2#11,COST#13] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/client..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DICT_ID_1:string,DICT_ID_2:string,COST:string>
Seeing the plan, I think that the transformations are getting too heavy for in-memory on spark local. Is it best practice to perform so many different step transformations or should I try to come up with a single query that would encompass all the business logic?
Additionally, could you please direct to any resource for understanding the Spark Plans we get using explain() function? Thanks
Seems like in left_outer operation:
# Final DF will have all columns from df1 and df2
final_df = df1.join(df2, df1.id.isin(df2.id_1, df2.id_2), 'left_outer')
final_df.show()
+-----+-----+-----+----+
| id| id_1| id_2|cost|
+-----+-----+-----+----+
|1_ghi|1_ghi|1_ghi| 12|
|2_mno|2_mno|2_rst| 86|
|3_xyz|3_def|3_xyz| 105|
|4_abc| null| null|null|
+-----+-----+-----+----+
# Select the required columns like id, cost etc.
final_df = df1.join(df2, df1.id.isin(df2.id_1, df2.id_2), 'left_outer').select('id','cost')
final_df.show()
+-----+----+
| id|cost|
+-----+----+
|1_ghi| 12|
|2_mno| 86|
|3_xyz| 105|
|4_abc|null|
+-----+----+
You can join twice and use coalesce
import pyspark.sql.functions as F
final_df = df_core\
.join(df_dict.select(F.col("id_1"), F.col("cost").alias("cost_1")), df_core.id== df_dict.id_1, 'left')\
.join(df_dict.select(F.col("id_2"), F.col("cost").alias("cost_2")), df_core.id== df_dict.id_2, 'left')\
.select(*[F.col(c) for c in df_core.columns], F.coalesce(F.col("cost_1"), F.col("cost_2")))

Apache Spark 2.2: broadcast join not working when you already cache the dataframe which you want to broadcast

I have mulitple large dataframes(around 30GB) called as and bs, a relatively small dataframe(around 500MB ~ 1GB) called spp.
I tried to cache spp into memory in order to avoid reading data from database or files multiple times.
But I find if I cache spp, the physical plan shows it won't use broadcast join even though spp is enclosed by broadcast function.
However, If I unpersist the spp, the plan shows it uses broadcast join.
Anyone familiar with this?
scala> spp.cache
res38: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> val as = acs.join(broadcast(spp), $"idsegment" === $"idAdnetProductSegment")
as: org.apache.spark.sql.DataFrame = [idsegmentpartner: bigint, ssegmentsource: string ... 44 more fields]
scala> as.explain
== Physical Plan ==
*SortMergeJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner
:- *Sort [idsegment#286L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(idsegment#286L, 200)
: +- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- *Sort [idAdnetProductSegment#91L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idAdnetProductSegment#91L, 200)
+- *Filter isnotnull(idAdnetProductSegment#91L)
+- InMemoryTableScan [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], [isnotnull(idAdnetProductSegment#91L)]
+- InMemoryRelation [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
scala> spp.unpersist
res40: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> as.explain
== Physical Plan ==
*BroadcastHashJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner, BuildRight
:- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[4, bigint, true]))
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] PushedFilters: [*IsNotNull(idAdnetProductSegment)], ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
This happens when the Analyzed plan tries to use the cache data. It swallows the ResolvedHint information supplied by the user(code).
If we try to do a df.explain(true), we will see that hint is lost between Analyzed and optimized plan, which is where Spark tries to use the cached data.
This issue has been fixed in the latest version of Spark(in multiple attempts).
latest jira: https://issues.apache.org/jira/browse/SPARK-27674 .
Code where the fix(to consider the hint when using cached tables) : https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L219

Spark 1.6 DataFrame optimize join partitioning

I have a question about Spark DataFrame partitioning, I'm currently using Spark 1.6 for project requirements.This is my code excerpt:
sqlContext.getConf("spark.sql.shuffle.partitions") // 6
val df = sc.parallelize(List(("A",1),("A",4),("A",2),("B",5),("C",2),("D",2),("E",2),("B",7),("C",9),("D",1))).toDF("id_1","val_1")
df.rdd.getNumPartitions // 4
val df2 = sc.parallelize(List(("B",1),("E",4),("H",2),("J",5),("C",2),("D",2),("F",2))).toDF("id_2","val_2")
df2.rdd.getNumPartitions // 4
val df3 = df.join(df2,$"id_1" === $"id_2")
df3.rdd.getNumPartitions // 6
val df4 = df3.repartition(3,$"id_1")
df4.rdd.getNumPartitions // 3
df4.explain(true)
The following is the explain plan has been created:
== Parsed Logical Plan ==
'RepartitionByExpression ['id_1], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Analyzed Logical Plan ==
id_1: string, val_1: int, id_2: string, val_2: int
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Optimized Logical Plan ==
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
:- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26
== Physical Plan ==
TungstenExchange hashpartitioning(id_1#42,3), None
+- SortMergeJoin [id_1#42], [id_2#46]
:- Sort [id_1#42 ASC], false, 0
: +- TungstenExchange hashpartitioning(id_1#42,6), None
: +- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
: +- Scan ExistingRDD[_1#40,_2#41]
+- Sort [id_2#46 ASC], false, 0
+- TungstenExchange hashpartitioning(id_2#46,6), None
+- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
+- Scan ExistingRDD[_1#44,_2#45]
As far I know, DataFrame represent an abstraction interface over RDD, so partitioning should be delegated to the Catalyst optimizer.
Infact compared to RDD where many transformations accept a number of partitions parameter, in order to optimize co-partitioning and co-locating whenever possible, with DataFrame the only chance to alter partitioning, is invoking the method repartition, otherwise the number of partitions for join and aggregations is inferred using the configuration param spark.sql.shuffle.partitions.
From what I can see and understand from the explain plan above it seems there is an useless repartition(so shuffle indeed) to 6 (the default value) after then repartitioning again to the final value imposed by the method repartition.
I believe the Optimizer could change the number of partitions of the join to the final value of 3.
Could someone help me to clarify that point? Maybe I missing something.
If you use spark sql, your shuffle partitions is always equal to spark.sql.shufle.partitions.But if you enable this spark.sql.adaptive.enabled it will add EchangeCoordinator.Right now, the work of this coordinator is to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages.

Resources