pyspark joinWithCassandraTable refactor without maps

pyspark joinWithCassandraTable refactor without maps - apache-spark

Im new to using spark/scala here and im having trouble with a refactor of some of my code here. Im running Scala 2.11 using pyspark and in a spark/yarn setup. The following is working but id like to clean it up, and to get the max performance out of this. I read elsewhere that pyspark udf and lambdas can cause huge performance impact so im trying to reduce or remove them were possible.
# Reduce ingest df1 data by joining on allowed table df2
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.rdd\
.map(lambda r: Row(tag=r['tag_id'], user_uuid=r['user_uuid']))
# Type column fixed to type=2, and tag==key
ready_to_join = to_process.map(lambda r: (r[0], 2, r[1]))
# Join with cassandra table to find matches
exists_in_cass = ready_to_join\
.joinWithCassandraTable(keyspace, table3)\
.on("user_uuid", "type")\
.select("user_uuid")
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
the cassandra table is such that
CREATE TABLE keyspace.table3 (
user_uuid uuid,
type int,
key text,
value text,
PRIMARY KEY (user_uuid, type, key)
) WITH CLUSTERING ORDER BY (type ASC, key ASC)
currently ive got
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.select(col("user_uuid"), col("tag_id").alias("tag"))
ready_to_join = to_process\
.withColumn("type", sf.lit(2))\
.select('user_uuid', 'type', col('tag').alias("key"))\
.rdd\
.map(lambda x: Row(x))
# planning on using repartitionByCassandraReplica here after I get it logically working
exists_in_cass = ready_to_join\
.joinWithCassandraTable(keyspace, table3)\
.on("user_uuid", "type")\
.select("user_uuid")
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
but im getting errors like
2020-10-30 15:10:42 WARN TaskSetManager:66 - Lost task 148.0 in stage 22.0 (TID ----, ---, executor 9): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
looking help from any spark gurus out there to point me to anything stupid I am doing here.
Update
Thanks to Alex's suggestion using the spark-cassandra-connector v2.5+ gives the ability for dataframes to join directly. I updated my code to use this instead.
to_process = df2\
.join(
sf.broadcast(df1),
df2.secondary_id == df1.secondary_id,
how="inner")\
.select(col("user_uuid"), col("tag_id").alias("tag"))
ready_to_join = to_process\
.withColumn("type", sf.lit(2))\
.select(col('user_uuid').alias('c1_user_uuid'), 'type', col('tag').alias("key"))\
cass_table = spark_session
.read \
.format("org.apache.spark.sql.cassandra") \
.options(table=config.table, keyspace=config.keyspace) \
.load()
exists_in_cass = ready_to_join\
.join(
cass_table,
[(cass_table["user_uuid"] == ready_to_join["c1_user_uuid"]) &
(cass_table["key"] == ready_to_join["key"]) &
(cass_table["type"] == ready_to_join["type"])])\
.select(col("c1_user_uuid").alias("user_uuid"))
exists_in_cass.explain()
log.error(f"TEST PRINT - [{exists_in_cass.count()}]")
As far as I know, in theory this should be alot faster ! But im getting errors during runtime with the database timing out.
WARN TaskSetManager:66 - Lost task 827.0 in stage 12.0 (TID 9946, , executor 4): java.io.IOException: Exception during execution of SELECT "user_uuid", "key" FROM "keyspace"."table3" WHERE token("user_uuid") > ? AND token("user_uuid") <= ? AND "type" = ? ALLOW FILTERING: Query timed out after PT2M
TaskSetManager:66 - Lost task 125.0 in stage 12.0 (TID 9215, , executor 7): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2M
etc
I have the config for spark setup to allow for the spark extensions
--packages mysql:mysql-connector-java:5.1.47,com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
The DAG from spark shows all nodes completely maxed out. Should I be partitioning my data before running my join here?
The explain for this also doesnt show a direct join (explain has more code than snippet above)
== Physical Plan ==
*(6) Project [c1_user_uuid#124 AS user_uuid#158]
+- *(6) SortMergeJoin [c1_user_uuid#124, key#125L], [user_uuid#129, cast(key#131 as bigint)], Inner
:- *(3) Sort [c1_user_uuid#124 ASC NULLS FIRST, key#125L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(c1_user_uuid#124, key#125L, 200)
: +- *(2) Project [id#0 AS c1_user_uuid#124, tag_id#101L AS key#125L]
: +- *(2) BroadcastHashJoin [secondary_id#60], [secondary_id#100], Inner, BuildRight
: :- *(2) Filter (isnotnull(secondary_id#60) && isnotnull(id#0))
: : +- InMemoryTableScan [secondary_id#60, id#0], [isnotnull(secondary_id#60), isnotnull(id#0)]
: : +- InMemoryRelation [secondary_id#60, id#0], StorageLevel(disk, memory, deserialized, 1 replicas)
: : +- *(7) Project [secondary_id#60, id#0]
: : +- Generate explode(split(secondary_ids#1, \|)), [id#0], false, [secondary_id#60]
: : +- *(6) Project [id#0, secondary_ids#1]
: : +- *(6) SortMergeJoin [id#0], [guid#46], Inner
: : :- *(2) Sort [id#0 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#0, 200)
: : : +- *(1) Filter (isnotnull(id#0) && id#0 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12})
: : : +- InMemoryTableScan [id#0, secondary_ids#1], [isnotnull(id#0), id#0 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}]
: : : +- InMemoryRelation [id#0, secondary_ids#1], StorageLevel(disk, memory, deserialized, 1 replicas)
: : : +- Exchange RoundRobinPartitioning(3840)
: : : +- *(1) Filter AtLeastNNulls(n, id#0,secondary_ids#1)
: : : +- *(1) FileScan csv [id#0,secondary_ids#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[inputdata_file, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,secondary_ids:string>
: : +- *(5) Sort [guid#46 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(guid#46, 200)
: : +- *(4) Filter (guid#46 RLIKE [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12} && isnotnull(guid#46))
: : +- Generate explode(set_guid#36), false, [guid#46]
: : +- *(3) Project [set_guid#36]
: : +- *(3) Filter (isnotnull(allowed#39) && (allowed#39 = 1))
: : +- *(3) FileScan orc whitelist.whitelist1[set_guid#36,region#39,timestamp#43] Batched: false, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://file, PartitionCount: 1, PartitionFilters: [isnotnull(timestamp#43), (timestamp#43 = 18567)], PushedFilters: [IsNotNull(region), EqualTo(region,1)], ReadSchema: struct<set_guid:array<string>,region:int>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
FROM TAG as T
JOIN MAP as M
ON T.tag_id = M.tag_id
WHERE (expire >= NOW() OR expire IS NULL)
ORDER BY T.tag_id) AS subset) [numPartitions=1] [secondary_id#100,tag_id#101L] PushedFilters: [*IsNotNull(secondary_id), *IsNotNull(tag_id)], ReadSchema: struct<secondary_id:string,tag_id:bigint>
+- *(5) Sort [user_uuid#129 ASC NULLS FIRST, cast(key#131 as bigint) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(user_uuid#129, cast(key#131 as bigint), 200)
+- *(4) Project [user_uuid#129, key#131]
+- *(4) Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user_uuid#129,key#131] PushedFilters: [*EqualTo(type,2)], ReadSchema: struct<user_uuid:string,key:string>
Im not getting the direct joins working which is causing time outs.
Update 2
I think this isnt resolving to direct joins as my datatypes in the dataframes are off. Specifically the uuid type

Instead of using RDD API with PySpark, I suggest to take Spark Cassandra Connector (SCC) 2.5.x or 3.0.x (release announcement) that contain the implementation of the join of Dataframe with Cassandra - in this case you won't need to go down to RDDs, but just use normal Dataframe API joins.
Please note that this is not enabled by default, so you will need to start your pyspark or spark-submit with special configuration, like this:
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
You can find more about joins with Cassandra in my recent blog post on this topic (although it uses Scala, Dataframe part should be translated almost one to one to PySpark)

Related

perform massively parallelizable tasks in pyspark

I often find myself performing massively parallelizable tasks in spark, but for some reason, spark keeps on dying. For instance, right now I have two tables (both stored on s3) that are essentially just collections of (unique) strings. I want to cross join, compute levenshtein distance, and write it out to s3 as a new table. So my code looks like:
OUT_LOC = 's3://<BUCKET>/<PREFIX>/'
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName('my-app') \
.config("hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
tb0 = spark.sql('SELECT col0 FROM db0.table0')
tb1 = spark.sql('SELECT col1 FROM db1.table1')
spark.sql("set spark.sql.crossJoin.enabled=true")
tb0.join(tb1).withColumn("levenshtein_distance",
F.levenshtein(F.col("col0"), F.col("col1"))) \
.write.format('parquet').mode('overwrite') \
.options(path=OUT_LOC, compression='snappy', maxRecordsPerFile=10000) \
.saveAsTable('db2.new_table')
It seems to me that this is massively parallelizable, and spark should be able to chug through this while only reading in a minimal amount of data at a time. But for some reason, the tasks keep on ghost dying.
So my questions are:
Is there a setting I'm missing? Or just more generally, what's going on here?
There's no reason for the whole thing to be stored locally, right?
What are some best practices here that I should consider?
For what it's worth, I have googled around extensively, but couldn't find anyone else with this issue. Maybe my google-foo isn't strong enough, or maybe I'm just doing something stupid.
edit
To #egordoe's advice...
I ran the explain and got back the following...
== Parsed Logical Plan ==
'Project [col0#0, col1#3, levenshtein('col0, 'col1) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Analyzed Logical Plan ==
col0: string, col1: string, levenshtein_distance: int
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Project [col0#0]
: +- Project [col0#0]
: +- SubqueryAlias `db0`.`table0`
: +- Relation[col0#0] parquet
+- Project [col1#3]
+- Project [col1#3]
+- SubqueryAlias `db1`.`table1`
+- Relation[col1#3] parquet
== Optimized Logical Plan ==
Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- Join Inner
:- Relation[col0#0] parquet
+- Relation[col1#3] parquet
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [col0#0, col1#3, levenshtein(col0#0, col1#3) AS levenshtein_distance#14]
+- BroadcastNestedLoopJoin BuildRight, Inner
:- FileScan parquet db0.table0[col0#0] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col0:string>
+- BroadcastExchange IdentityBroadcastMode
+- FileScan parquet db1.table1[col1#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://REDACTED], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:string>
========== finished
Seems reasonable to me, but the explanation doesn't include the actual writing of the data. I assume that's because it likes to build up a cache of results locally then ship the whole thing to s3 as a table after? That would be pretty lame.
edit 1
I also ran the foreach example you suggested with a simple print statement in there. It hung around for 40 minutes without printing anything before I killed it. I'm now running the job with a function that does nothing (it's just a pass statement) to see if it even finishes.

Spark not use DirectJoin over DSE

I'm developing a Spark streaming task that joins data from stream with a Cassandra Table. As you can see in Explain Plan Direct Join is not used.
According to DSE doc Direct Join is used when (table size * directJoinSizeRatio) > size of keys.
In my case Table has millions of record and keys are only one record (form streaming), so i'm expecting Diret Join is used.
Table radice_polizza has only id_cod_polizza column as partition jey.
Connector version:2.5.1.
DSE version: 6.7.6.
*Project [id_cod_polizza#86L, progressivo#11, id3_numero_polizza#25, id3_cod_compagnia#21]
+- *SortMergeJoin [id_cod_polizza#86L], [id_cod_polizza#10L], Inner
:- *Sort [id_cod_polizza#86L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id_cod_polizza#86L, 200)
: +- *Project [value#84L AS id_cod_polizza#86L]
: +- *SerializeFromObject [input[0, bigint, false] AS value#84L]
: +- Scan ExternalRDDScan[obj#83L]
+- *Sort [id_cod_polizza#10L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id_cod_polizza#10L, 200)
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [id_cod_polizza#10L,progressivo#11,id3_numero_polizza#25,id3_cod_compagnia#21] ReadSchema: struct<id_cod_polizza:bigint,progressivo:string,id3_numero_polizza:string,id3_cod_compagnia:string>
Here is my code:
var radice_polizza = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "radice_polizza", "keyspace" -> "preferred_temp"))
.load().select(
"id_cod_polizza",
"progressivo",
"id3_numero_polizza",
"id3_cod_compagnia")
if(mode == LoadMode.DIFF){
val altered_data_df = altered_data.idCodPolizzaList.toDF("id_cod_polizza")
radice_polizza = altered_data_df.join(radice_polizza, Seq("id_cod_polizza"))
radice_polizza.explain()
}
Forcing Direct Join it works.
radice_polizza = altered_data_df.join(radice_polizza.directJoin(AlwaysOn), Seq("id_cod_polizza"))
== Physical Plan ==
*Project [id_cod_polizza#58L, progressivo#11, id3_numero_polizza#25, id3_cod_compagnia#21]
+- DSE Direct Join [id_cod_polizza = id_cod_polizza#58L] preferred_temp.radice_polizza - Reading (id_cod_polizza, progressivo, id3_numero_polizza, id3_cod_compagnia) Pushed {}
+- *Project [value#56L AS id_cod_polizza#58L]
+- *SerializeFromObject [input[0, bigint, false] AS value#56L]
+- Scan ExternalRDDScan[obj#55L]
Why Direct Join is not used automatically?
Thnak you

DSE Direct Join is enabled automatically when you're developing application using DSE Analytics dependencies that are provided when you run your job on DSE Analytics. You need to specify following dependency for that, and don't use Spark Cassandra Connector:
<dependency>
<groupId>com.datastax.dse</groupId>
<artifactId>dse-spark-dependencies</artifactId>
<version>${dse.version}</version>
<scope>provided</scope>
</dependency>
if you run your job on external Spark, then you need to explicitly enable direct join by specifying Spark configuration property spark.sql.extensions with value of com.datastax.spark.connector.CassandraSparkExtensions.
I have a long blog post on the joining data with Cassandra that explains all this things.

How to filter rows where key is not present in a large dataframe

Suppose I have a streaming dataframe A and a large static dataframe B. Assume that typically A is of size < 10000 records. However, B is a much larger dataframe with size in the range of millions.
Lets assume both A and B have a 'key' column. I want to filter rows in A where A.key is not present in B. What is the best way to achieve this.
Right now, I have tried A.join(B, Seq("key"), "left_anti"). However, the performance is not upto the mark. Is there anyway I can fasten up the process
Physical plan:
== Physical Plan ==
SortMergeJoin [domainName#461], [domain#147], LeftAnti
:- *(5) Sort [domainName#461 ASC NULLS FIRST], false, 0
: +- StreamingDeduplicate [domainName#461], state info [ checkpoint = hdfs://MTPrime-CO4-fed/MTPrime-CO4-0/projects/BingAdsAdQuality/Test/WhoIs/WhoIsStream/checkPoint/state, runId = 9d09398b-efda-41cb-ab77-1b5550cd5da9, opId = 0, ver = 63, numPartitions = 400], 0
: +- Exchange hashpartitioning(domainName#461, 400)
: +- Union
: :- *(2) Project [value#460 AS domainName#461]
: : +- *(2) Filter isnotnull(value#460)
: : +- *(2) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#460]
: : +- MapPartitions <function1>, obj#459: java.lang.String
: : +- MapPartitions <function1>, obj#436: MTInterfaces.Fraud.RiskEntity
: : +- DeserializeToObject newInstance(class scala.Tuple3), obj#435: scala.Tuple3
: : +- Exchange RoundRobinPartitioning(600)
: : +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#142, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, assertnotnull(input[0, scala.Tuple3, true])._2, true, false) AS _2#143, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._3, true, false) AS _3#144]
: : +- *(1) MapElements <function1>, obj#141: scala.Tuple3
: : +- *(1) MapElements <function1>, obj#132: scala.Tuple3
: : +- *(1) DeserializeToObject createexternalrow(Body#60.toString, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, ObjectType(class java.sql.Timestamp), toJavaTimestamp, EventTime#37, true, false), Timestamp#48L, Offset#27L, Partition#72.toString, PartitionKey#84.toString, Publisher#96.toString, SequenceNumber#108L, StructField(Body,StringType,true), StructField(EventTime,TimestampType,true), StructField(Timestamp,LongType,true), StructField(Offset,LongType,true), StructField(Partition,StringType,true), StructField(PartitionKey,StringType,true), StructField(Publisher,StringType,true), StructField(SequenceNumber,LongType,true)), obj#131: org.apache.spark.sql.Row
: : +- *(1) Project [cast(body#608 as string) AS Body#60, enqueuedTime#612 AS EventTime#37, cast(enqueuedTime#612 as bigint) AS Timestamp#48L, cast(offset#610 as bigint) AS Offset#27L, partition#609 AS Partition#72, partitionKey#614 AS PartitionKey#84, publisher#613 AS Publisher#96, sequenceNumber#611L AS SequenceNumber#108L]
: : +- Scan ExistingRDD[body#608,partition#609,offset#610,sequenceNumber#611L,enqueuedTime#612,publisher#613,partitionKey#614,properties#615,systemProperties#616]
: +- *(4) Project [value#453 AS domainName#455]
: +- *(4) Filter isnotnull(value#453)
: +- *(4) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#453]
: +- *(4) MapElements <function1>, obj#452: java.lang.String
: +- MapPartitions <function1>, obj#436: MTInterfaces.Fraud.RiskEntity
: +- DeserializeToObject newInstance(class scala.Tuple3), obj#435: scala.Tuple3
: +- ReusedExchange [_1#142, _2#143, _3#144], Exchange RoundRobinPartitioning(600)
+- *(8) Project [domain#147]
+- *(8) Filter (isnotnull(rank#284) && (rank#284 = 1))
+- Window [row_number() windowspecdefinition(domain#147, timestamp#151 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#284], [domain#147], [timestamp#151 DESC NULLS LAST]
+- *(7) Sort [domain#147 ASC NULLS FIRST, timestamp#151 DESC NULLS LAST], false, 0
+- Exchange hashpartitioning(domain#147, 400)
+- *(6) Project [domain#147, timestamp#151]
+- *(6) Filter isnotnull(domain#147)
+- *(6) FileScan csv [domain#147,timestamp#151] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://MTPrime-CO4-fed/MTPrime-CO4-0/projects/BingAdsAdQuality/Test/WhoIs], PartitionFilters: [], PushedFilters: [IsNotNull(domain)], ReadSchema: struct<domain:string,timestamp:string>
Snapshots of query graph:
EDIT
Right now I have moved the lookup data to a Cosmos DB store and created a TempView on top of it (say lookupdata). Now, I need to filter the ones that are not present in the store. I am exploring the following options:
1. create tempview on top of the streaming data as well and query
spark.sql(SELECT * FROM streamingdata s LEFT ANTI JOIN lookupdata l ON s.key = l.key")
Same as 1 but do inner sub-query instead of left anti join. i.e spark.sql("SELECT s.* FROM streamingdata s WHERE s.key NOT IN (SELECT key FROM lookupdata l)")
Retain the streaming df as it is and do a filter op:
df.filter(x => { val key = x.getAs[String])("key")
spark.sql("SELECT * FROM lookupdata l WHERE l.key = '"+key+"'").isEmpty
})
which one would work better?

Please try
from pyspark.sql.functions import broadcast
A.join(broadcast(B), Seq("key"), "left_anti")

It is not the recommended approach to do this with (Structured) Streaming. Imagine you are a Chinese company with 100M customers. How do you see that working on B with a 100M rows?
From my last assignment: If large dataset for reference data evident, use Hbase, or some other other key value store like Cassandra, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was no easy task the data engineer, designer told me. Indeed, it is not that easy. But the way to go.

Apache Spark 2.2: broadcast join not working when you already cache the dataframe which you want to broadcast

I have mulitple large dataframes(around 30GB) called as and bs, a relatively small dataframe(around 500MB ~ 1GB) called spp.
I tried to cache spp into memory in order to avoid reading data from database or files multiple times.
But I find if I cache spp, the physical plan shows it won't use broadcast join even though spp is enclosed by broadcast function.
However, If I unpersist the spp, the plan shows it uses broadcast join.
Anyone familiar with this?
scala> spp.cache
res38: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> val as = acs.join(broadcast(spp), $"idsegment" === $"idAdnetProductSegment")
as: org.apache.spark.sql.DataFrame = [idsegmentpartner: bigint, ssegmentsource: string ... 44 more fields]
scala> as.explain
== Physical Plan ==
*SortMergeJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner
:- *Sort [idsegment#286L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(idsegment#286L, 200)
: +- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- *Sort [idAdnetProductSegment#91L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idAdnetProductSegment#91L, 200)
+- *Filter isnotnull(idAdnetProductSegment#91L)
+- InMemoryTableScan [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], [isnotnull(idAdnetProductSegment#91L)]
+- InMemoryRelation [id#87L, idPartner#88, idSegmentPartner#89, sSegmentSourceArray#90, idAdnetProductSegment#91L, idPartnerProduct#92L, idFeed#93, idGlobalProduct#94, sBrand#95, sSku#96, sOnlineID#97, sGTIN#98, sProductCategory#99, sAvailability#100, sCondition#101, sDescription#102, sImageLink#103, sLink#104, sTitle#105, sMPN#106, sPrice#107, sAgeGroup#108, sColor#109, dateExpiration#110, sGender#111, sItemGroupId#112, sGoogleProductCategory#113, sMaterial#114, sPattern#115, sProductType#116, sSalePrice#117, sSalePriceEffectiveDate#118, sShipping#119, sShippingWeight#120, sShippingSize#121, sUnmappedAttributeList#122, sStatus#123, createdBy#124, updatedBy#125, dateCreate#126, dateUpdated#127, sProductKeyName#128, sProductKeyValue#129], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...
scala> spp.unpersist
res40: spp.type = [id: bigint, idPartner: int ... 41 more fields]
scala> as.explain
== Physical Plan ==
*BroadcastHashJoin [idsegment#286L], [idAdnetProductSegment#91L], Inner, BuildRight
:- *Filter isnotnull(idsegment#286L)
: +- HiveTableScan [idsegmentpartner#282L, ssegmentsource#287, idsegment#286L], CatalogRelation `default`.`tblcustomsegmentcore`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [idcustomsegment#281L, idsegmentpartner#282L, ssegmentpartner#283, skey#284, svalue#285, idsegment#286L, ssegmentsource#287, datecreate#288]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[4, bigint, true]))
+- *Scan JDBCRelation(tblSegmentPartnerProduct) [numPartitions=1] [id#87L,idPartner#88,idSegmentPartner#89,sSegmentSourceArray#90,idAdnetProductSegment#91L,idPartnerProduct#92L,idFeed#93,idGlobalProduct#94,sBrand#95,sSku#96,sOnlineID#97,sGTIN#98,sProductCategory#99,sAvailability#100,sCondition#101,sDescription#102,sImageLink#103,sLink#104,sTitle#105,sMPN#106,sPrice#107,sAgeGroup#108,sColor#109,dateExpiration#110,sGender#111,sItemGroupId#112,sGoogleProductCategory#113,sMaterial#114,sPattern#115,sProductType#116,sSalePrice#117,sSalePriceEffectiveDate#118,sShipping#119,sShippingWeight#120,sShippingSize#121,sUnmappedAttributeList#122,sStatus#123,createdBy#124,updatedBy#125,dateCreate#126,dateUpdated#127,sProductKeyName#128,sProductKeyValue#129] PushedFilters: [*IsNotNull(idAdnetProductSegment)], ReadSchema: struct<id:bigint,idPartner:int,idSegmentPartner:int,sSegmentSourceArray:string,idAdnetProductSegm...

This happens when the Analyzed plan tries to use the cache data. It swallows the ResolvedHint information supplied by the user(code).
If we try to do a df.explain(true), we will see that hint is lost between Analyzed and optimized plan, which is where Spark tries to use the cached data.
This issue has been fixed in the latest version of Spark(in multiple attempts).
latest jira: https://issues.apache.org/jira/browse/SPARK-27674 .
Code where the fix(to consider the hint when using cached tables) : https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L219

Hive on spark, partition pruning, better undertanding

I have a spark 1.6.2 code using SQL/HQL language.
I really tried to understand if my job is doing partition pruning or not.
Data is partitioned by date (cdate field)
the explain plan is :
== Physical Plan ==
Project [coalesce(cdate#74,cdate#38) AS cdate#29,coalesce(account_key#75,account_key#34) AS account_key#30,coalesce(product#76,product#35) AS product#31,(coalesce(amount#77,0.0) + coalesce(amount#36,0.0)) AS amount#32,(coalesce(volume#78L,0) + cast(coalesce(volume#37,0) as bigint)) AS volume#33L]
+- SortMergeOuterJoin [account_key#34,cdate#38,product#35], [account_key#75,cdate#74,product#76], FullOuter, None
:- Sort [account_key#34 ASC,cdate#38 ASC,product#35 ASC], false, 0
: +- TungstenExchange hashpartitioning(account_key#34,cdate#38,product#35,200), None
: +- Project [volume#37,product#35,cdate#38,account_key#34,amount#36]
: +- BroadcastHashJoin [cdate#38], [cdate#24], BuildLeft
: :- Scan ParquetRelation[account_key#34,product#35,amount#36,volume#37,cdate#38] InputPaths: hdfs://hdp1.voicelab.local:8020/apps/hive/warehouse/my.db/daily_profiles
: +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
: +- TungstenExchange hashpartitioning(cdate#24,200), None
: +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
: +- Project [cdate#24]
: +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#24])
: +- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
: +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#20,accountKey#21,product#22])
: +- Project [cdate#20,accountKey#21,product#22]
: +- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
+- Sort [account_key#75 ASC,cdate#74 ASC,product#76 ASC], false, 0
+- TungstenExchange hashpartitioning(account_key#75,cdate#74,product#76,200), None
+- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Final,isDistinct=false),(count(1),mode=Final,isDistinct=false)], output=[cdate#74,account_key#75,product#76,amount#77,volume#78L])
+- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
+- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Partial,isDistinct=false),(count(1),mode=Partial,isDistinct=false)], output=[cdate#20,accountKey#21,product#22,sum#54,count#55L])
+- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
How can I figure out if my job is using the metastore in order to do partition pruning.
Can you elaborate about Scan ParquetRelation? how can I know that the scan using partition pruning/discovery ?
what is the meaning for the field#SOME_NUMBER i.e account_key#34
The use case is aggregating data per date,account,product

Look for PartitionFilters: [... ] in the Physical plan. If the array has a non empty value, it's using otherwise no. I couldn't find in your plan, unless I missed it or could not find it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pyspark joinWithCassandraTable refactor without maps - apache-spark

Related

perform massively parallelizable tasks in pyspark

Spark not use DirectJoin over DSE

How to filter rows where key is not present in a large dataframe

Apache Spark 2.2: broadcast join not working when you already cache the dataframe which you want to broadcast

Hive on spark, partition pruning, better undertanding

Categories

Resources