Spark: Weird partitioning on join - apache-spark

I have a Spark SQL query that works somewhat like this (actual fields have been omitted):
SELECT
a1.fieldA,
a1.fieldB,
a1.fieldC,
a1.fieldD,
a1.joinType
FROM sample_a a1
WHERE a1.joinType != "test"
UNION
SELECT
a2.fieldA,
a2.fieldB,
a2.fieldC,
b.fieldD,
a2.joinType
FROM sample_a a2
INNER JOIN sample_b b ON b.joinField = a2.joinField
WHERE a2.joinType = "test"
This is working perfectly fine but Spark will read sample_a twice. (From cache or disk)
I'm trying to get rid of the union and came up with the following solution:
SELECT
a.fieldA,
a.fieldB,
a.fieldC,
a.joinType,
CASE WHEN a.joinType = "test" THEN b.fieldD ELSE a.fieldD END as fieldD
FROM sample_a a
LEFT JOIN sample_b b ON a.joinType = "test" AND a.joinField = b.joinField
WHERE a.joinType != "test" OR (a.joinType = "test" AND b.joinField IS NOT NULL)
This should basically do the same thing but Spark is being very weird about it. While the first one keeps the partitions the same as sample_a (~1200) the second one will go down to 200 partitions, which is what sample_b has. It will also put a lot of data into a single partition. (Around 90% of data is in one of the 200 partitions)
The input data is stored in parquet files and not partitioned in any way. While sample_a has a much bigger file size, the joinField values for our joinType = "test" part are a subset of the joinField values in sample_b.
Edit: The physical plans look like this.
First Query:
Union
:- *(1) Project [fieldA#0, fieldD#1 joinType#2, joinField#3]
: +- *(1) Filter (isnotnull(joinType#2) && NOT (joinType#2 = test))
: +- *(1) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(joinType), Not(EqualTo(joinType,test))], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(6) Project [fieldA#0, fieldD#4 joinType#2, joinField#3]
+- *(6) SortMergeJoin [joinField#3], [joinField#5], Inner
:- *(3) Sort [joinField#3 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(joinField#3, 200)
: +- *(2) Project [fieldA#0, fieldD#1 joinType#2, joinField#3]
: +- *(2) Filter ((isnotnull(joinType#2) && (joinType#2 = test)) && isnotnull(joinField#3))
: +- *(2) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(joinType), EqualTo(joinType,test), IsNotNull(joinField)], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(5) Sort [joinField#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(joinField#5, 200)
+- *(4) FileScan parquet [fieldD#4, joinField#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldD:string,joinField:string>
Second Query:
*(5) Project [fieldA#0, CASE WHEN (joinType#2 = test) THEN fieldD#4 ELSE fieldD#1 END AS fieldD#6, joinType#2, joinField#3]
+- *(5) Filter (NOT (joinType#2 = test) || ((joinType#2 = test) && isnotnull(joinField#5)))
+- SortMergeJoin [joinField#3], [joinField#5], LeftOuter, (joinType#2 = test)
:- *(2) Sort [joinField#3 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(joinField#3, 200)
: +- *(1) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(4) Sort [joinField#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(joinField#5, 200)
+- *(3) FileScan parquet [fieldD#4, joinField#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldD:string,joinField:string>

Related

How to prevent a sort on a groupby.applyInPandas using hash partitioning on the upstream dataset?

In my main transform, I'm running an algorithm by doing a groupby and then applyInPandas in Foundry. The build takes very long, and one idea is to organize the files to prevent shuffle reads and sorting, using Hash partitioning/bucketing.
For a mcve, I have the following dataset:
def example_df():
return spark.createDataFrame(
[("1","2", 1.0), ("1","3", 2.0), ("2","4", 3.0), ("2","5", 5.0), ("2","2", 10.0)],
("id_1","id_2", "v"))
The transform I want to apply is:
def df1(example_df):
def subtract_mean(pdf):
v = pdf.v
return pdf.assign(v=v - v.mean())
return example_df.groupby("id_1","id_2").applyInPandas(subtract_mean, schema="id_1 string, id_2 string, v double")
When I look at the original query plan, with no partitioning, it looks like the following:
Physical Plan:
Execute FoundrySaveDatasetCommand `ri.foundry.main.transaction.00000059-eb1b-61f4-bdb8-a030ac6baf0a#master`.`ri.foundry.main.dataset.eb664037-fcae-4ce2-b92b-bd103cd504b3`, ErrorIfExists, [id_1, id_2, v], ComputedStatsServiceV2Blocking{_endpointChannelFactory=DialogueChannel#3127a629{channelName=dialogue-nonreloading-ComputedStatsServiceV2Blocking, delegate=com.palantir.dialogue.core.DialogueChannel$Builder$$Lambda$713/0x0000000800807c40#70f51090}, runtime=com.palantir.conjure.java.dialogue.serde.DefaultConjureRuntime#6c67a62a}, com.palantir.foundry.spark.catalog.caching.CachingSchemaService#7d881feb, com.palantir.foundry.spark.catalog.caching.CachingMetadataService#57a1ef9e, com.palantir.foundry.spark.catalog.FoundrySparkResolver#4d38f6f5, com.palantir.foundry.spark.auth.DefaultFoundrySparkAuthSupplier#21103ab4
+- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(3) BasicStats `ri.foundry.main.transaction.00000059-eb1b-61f4-bdb8-a030ac6baf0a#master`.`ri.foundry.main.dataset.eb664037-fcae-4ce2-b92b-bd103cd504b3`
+- FlatMapGroupsInPandas [id_1#487, id_2#488], subtract_mean(id_1#487, id_2#488, v#489), [id_1#497, id_2#498, v#499]
+- *(2) Sort [id_1#487 ASC NULLS FIRST, id_2#488 ASC NULLS FIRST], false, 0
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 0
+- Exchange hashpartitioning(id_1#487, id_2#488, 200), ENSURE_REQUIREMENTS, [id=#324]
+- *(1) Project [id_1#487, id_2#488, id_1#487, id_2#488, v#489]
+- *(1) ColumnarToRow
+- FileScan parquet !ri.foundry.main.transaction.00000059-eb12-f234-b25f-57e967fbc68e:ri.foundry.main.transaction.00000059-eb12-f234-b25f-57e967fbc68e#00000003-99f9-3d2d-814f-e4db9c920cc2:master.ri.foundry.main.dataset.237cddc5-0835-425c-bfbe-e62c51779dc2[id_1#487,id_2#488,v#489] Batched: true, BucketedScan: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[sparkfoundry:///datasets/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id_1:string,id_2:string,v:double>, ScanMode: RegularMode
My goal is to prevent the need for the Sort, Shuffle (read and query) and Exchange from the query plan.
To achieve this, I hash partition an intermediate dataset, bucketing by the id columns I'm going to groupBy later:
def example_df_bucketed(example_df):
example_df = example_df.repartition(2,"id_1","id_2")
output = Transforms.get_output()
output_fs = output.filesystem()
output.write_dataframe(example_df,bucket_cols=["id_1","id_2"], sort_by=["id_1","id_2"], bucket_count=2)
I try and run the same logic, this time with the bucketed dataset as the input
def df2(example_df_bucketed):
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
return example_df_bucketed.groupby("id_1","id_2").applyInPandas(subtract_mean, schema="id_1 string, id_2 string, v double")
This results in the query plan not having a shuffle (hash partition), but it is still sorting.
Physical Plan:
Execute FoundrySaveDatasetCommand `ri.foundry.main.transaction.00000059-ec4c-26f7-a058-98be3f26018c#master`.`ri.foundry.main.dataset.02990b20-95f2-4605-9e7c-578ba071535d`, ErrorIfExists, [id_1, id_2, v], ComputedStatsServiceV2Blocking{_endpointChannelFactory=DialogueChannel#3127a629{channelName=dialogue-nonreloading-ComputedStatsServiceV2Blocking, delegate=com.palantir.dialogue.core.DialogueChannel$Builder$$Lambda$713/0x0000000800807c40#70f51090}, runtime=com.palantir.conjure.java.dialogue.serde.DefaultConjureRuntime#6c67a62a}, com.palantir.foundry.spark.catalog.caching.CachingSchemaService#3db2ee77, com.palantir.foundry.spark.catalog.caching.CachingMetadataService#8086bb, com.palantir.foundry.spark.catalog.FoundrySparkResolver#7ebc329, com.palantir.foundry.spark.auth.DefaultFoundrySparkAuthSupplier#46d15de1
+- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
*(2) BasicStats `ri.foundry.main.transaction.00000059-ec4c-26f7-a058-98be3f26018c#master`.`ri.foundry.main.dataset.02990b20-95f2-4605-9e7c-578ba071535d`
+- FlatMapGroupsInPandas [id_1#603, id_2#604], subtract_mean(id_1#603, id_2#604, v#605), [id_1#613, id_2#614, v#615]
+- *(1) Sort [id_1#603 ASC NULLS FIRST, id_2#604 ASC NULLS FIRST], false, 0
+- *(1) Project [id_1#603, id_2#604, id_1#603, id_2#604, v#605]
+- *(1) ColumnarToRow
+- FileScan parquet !ri.foundry.main.transaction.00000059-ec22-2287-a0e3-5d9c48a39a83:ri.foundry.main.transaction.00000059-ec22-2287-a0e3-5d9c48a39a83#00000003-99fc-8b63-8d17-b7e45fface86:master.ri.foundry.main.dataset.bbada128-5538-4c7f-b5ba-6d16b15da5bf[id_1#603,id_2#604,v#605] Batched: true, BucketedScan: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[sparkfoundry://foundry/datasets/..., PartitionFilters: [], Partitioning: hashpartitioning(id_1#603, id_2#604, 2), PushedFilters: [], ReadSchema: struct<id_1:string,id_2:string,v:double>, ScanMode: RegularMode, SelectedBucketsCount: 2 out of 2
Since I'm already setting the sort_by when I bucket upstream, why is there still a sort in the query plan? Is there something I can do to avoid this sort?

How to filter rows where key is not present in a large dataframe

Suppose I have a streaming dataframe A and a large static dataframe B. Assume that typically A is of size < 10000 records. However, B is a much larger dataframe with size in the range of millions.
Lets assume both A and B have a 'key' column. I want to filter rows in A where A.key is not present in B. What is the best way to achieve this.
Right now, I have tried A.join(B, Seq("key"), "left_anti"). However, the performance is not upto the mark. Is there anyway I can fasten up the process
Physical plan:
== Physical Plan ==
SortMergeJoin [domainName#461], [domain#147], LeftAnti
:- *(5) Sort [domainName#461 ASC NULLS FIRST], false, 0
: +- StreamingDeduplicate [domainName#461], state info [ checkpoint = hdfs://MTPrime-CO4-fed/MTPrime-CO4-0/projects/BingAdsAdQuality/Test/WhoIs/WhoIsStream/checkPoint/state, runId = 9d09398b-efda-41cb-ab77-1b5550cd5da9, opId = 0, ver = 63, numPartitions = 400], 0
: +- Exchange hashpartitioning(domainName#461, 400)
: +- Union
: :- *(2) Project [value#460 AS domainName#461]
: : +- *(2) Filter isnotnull(value#460)
: : +- *(2) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#460]
: : +- MapPartitions <function1>, obj#459: java.lang.String
: : +- MapPartitions <function1>, obj#436: MTInterfaces.Fraud.RiskEntity
: : +- DeserializeToObject newInstance(class scala.Tuple3), obj#435: scala.Tuple3
: : +- Exchange RoundRobinPartitioning(600)
: : +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#142, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, assertnotnull(input[0, scala.Tuple3, true])._2, true, false) AS _2#143, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._3, true, false) AS _3#144]
: : +- *(1) MapElements <function1>, obj#141: scala.Tuple3
: : +- *(1) MapElements <function1>, obj#132: scala.Tuple3
: : +- *(1) DeserializeToObject createexternalrow(Body#60.toString, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, ObjectType(class java.sql.Timestamp), toJavaTimestamp, EventTime#37, true, false), Timestamp#48L, Offset#27L, Partition#72.toString, PartitionKey#84.toString, Publisher#96.toString, SequenceNumber#108L, StructField(Body,StringType,true), StructField(EventTime,TimestampType,true), StructField(Timestamp,LongType,true), StructField(Offset,LongType,true), StructField(Partition,StringType,true), StructField(PartitionKey,StringType,true), StructField(Publisher,StringType,true), StructField(SequenceNumber,LongType,true)), obj#131: org.apache.spark.sql.Row
: : +- *(1) Project [cast(body#608 as string) AS Body#60, enqueuedTime#612 AS EventTime#37, cast(enqueuedTime#612 as bigint) AS Timestamp#48L, cast(offset#610 as bigint) AS Offset#27L, partition#609 AS Partition#72, partitionKey#614 AS PartitionKey#84, publisher#613 AS Publisher#96, sequenceNumber#611L AS SequenceNumber#108L]
: : +- Scan ExistingRDD[body#608,partition#609,offset#610,sequenceNumber#611L,enqueuedTime#612,publisher#613,partitionKey#614,properties#615,systemProperties#616]
: +- *(4) Project [value#453 AS domainName#455]
: +- *(4) Filter isnotnull(value#453)
: +- *(4) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#453]
: +- *(4) MapElements <function1>, obj#452: java.lang.String
: +- MapPartitions <function1>, obj#436: MTInterfaces.Fraud.RiskEntity
: +- DeserializeToObject newInstance(class scala.Tuple3), obj#435: scala.Tuple3
: +- ReusedExchange [_1#142, _2#143, _3#144], Exchange RoundRobinPartitioning(600)
+- *(8) Project [domain#147]
+- *(8) Filter (isnotnull(rank#284) && (rank#284 = 1))
+- Window [row_number() windowspecdefinition(domain#147, timestamp#151 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#284], [domain#147], [timestamp#151 DESC NULLS LAST]
+- *(7) Sort [domain#147 ASC NULLS FIRST, timestamp#151 DESC NULLS LAST], false, 0
+- Exchange hashpartitioning(domain#147, 400)
+- *(6) Project [domain#147, timestamp#151]
+- *(6) Filter isnotnull(domain#147)
+- *(6) FileScan csv [domain#147,timestamp#151] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://MTPrime-CO4-fed/MTPrime-CO4-0/projects/BingAdsAdQuality/Test/WhoIs], PartitionFilters: [], PushedFilters: [IsNotNull(domain)], ReadSchema: struct<domain:string,timestamp:string>
Snapshots of query graph:
EDIT
Right now I have moved the lookup data to a Cosmos DB store and created a TempView on top of it (say lookupdata). Now, I need to filter the ones that are not present in the store. I am exploring the following options:
1. create tempview on top of the streaming data as well and query
spark.sql(SELECT * FROM streamingdata s LEFT ANTI JOIN lookupdata l ON s.key = l.key")
Same as 1 but do inner sub-query instead of left anti join. i.e spark.sql("SELECT s.* FROM streamingdata s WHERE s.key NOT IN (SELECT key FROM lookupdata l)")
Retain the streaming df as it is and do a filter op:
df.filter(x => { val key = x.getAs[String])("key")
spark.sql("SELECT * FROM lookupdata l WHERE l.key = '"+key+"'").isEmpty
})
which one would work better?
Please try
from pyspark.sql.functions import broadcast
A.join(broadcast(B), Seq("key"), "left_anti")
It is not the recommended approach to do this with (Structured) Streaming. Imagine you are a Chinese company with 100M customers. How do you see that working on B with a 100M rows?
From my last assignment: If large dataset for reference data evident, use Hbase, or some other other key value store like Cassandra, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was no easy task the data engineer, designer told me. Indeed, it is not that easy. But the way to go.

left join on a key if there is no match then join on a different right key to get value

I have two spark dataframes, say df_core & df_dict:
There are more cols in df_core but it has nothing to do with the question here
df_core:
id
1_ghi
2_mno
3_xyz
4_abc
df_dict:
id_1 id_2 cost
1_ghi 1_ghi 12
2_mno 2_rst 86
3_def 3_xyz 105
I want to get the value from df_dict.cost by joining the 2 dfs.
Scenario: join on df_core.id == df_dict.id_1
If there is a no match for df_core.id for the foreign key df_dict.id_1 (for above example: 3_xyz) then, the join should happen on df_dict.id_2
I am able to achieve the join for the first key but have not sure about how to achieve the scenario
final_df = df_core.alias("df_core_alias").join(df_dict, df_core.id== df_dict.id_1, 'left').select('df_core_alias.*', df_dict.cost)
The solution need not be a dataframe operation. I can create Temp Views out of the dataframes & then run SQL on it if that's easy and/or optimized.
I also have a SQL solution in-mind (not tested):
SELECT
core.id,
dict.cost
FROM
df_core core LEFT JOIN df_dict dict
ON core.id = dict.id_1
OR core.id = dict.id_2
Expected df:
id cost
1_ghi 12
2_mno 86
3_xyz 105
4_abc
Well the project plan is too big to add in the comment so I've to question here
below is the spark plan for isin:
== Physical Plan ==
*(3) Project [region_type#26, COST#13, CORE_SECTOR_VALUE#21, CORE_ID#22]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, CORE_ID#22 IN (DICT_ID_1#10,DICT_ID_2#11)
:- *(1) Project [CORE_SECTOR_VALUE#21, CORE_ID#22, region_type#26]
: +- *(1) Filter ((((isnotnull(response_value#23) && isnotnull(error_code#19L)) && (error_code#19L = 0)) && NOT (response_value#23 = )) && NOT response_value#23 IN (N.A.,N.D.,N.S.))
: +- *(1) FileScan parquet [ERROR_CODE#19L,CORE_SECTOR_VALUE#21,CORE_ID#22,RESPONSE_VALUE#23,source_system#24,fee_type#25,region_type#26,run_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/outfile/..., PartitionCount: 14, PartitionFilters: [isnotnull(run_date#27), (run_date#27 = 20190905)], PushedFilters: [IsNotNull(RESPONSE_VALUE), IsNotNull(ERROR_CODE), EqualTo(ERROR_CODE,0), Not(EqualTo(RESPONSE_VA..., ReadSchema: struct<ERROR_CODE:bigint,CORE_SECTOR_VALUE:string,CORE_ID:string,RESPONSE_VALUE:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(2) FileScan csv [DICT_ID_1#10,DICT_ID_2#11,COST#13] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/client..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DICT_ID_1:string,DICT_ID_2:string,COST:string>
The Filter in BroadcastNestedLoopJoin is coming from previous df_core transformations but as we know spark's lazy-evaluation, we're seeing it here in the project plan
Moreover, I just realized that the final_df.show() works fine for any solution I use. But what's taking infinite time to process is the next transformation that I'm doing over the final_df which is my actual expected_df. Here's my next transformation:
expected_df = spark.sql("select region_type, cost, core_sector_value, count(core_id) from final_df_view group by region_type, cost, core_sector_value order by region_type, cost, core_sector_value")
& here's the plan for the expected_df:
== Physical Plan ==
*(5) Sort [region_type#26 ASC NULLS FIRST, cost#13 ASC NULLS FIRST, core_sector_value#21 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(region_type#26 ASC NULLS FIRST, cost#13 ASC NULLS FIRST, core_sector_value#21 ASC NULLS FIRST, 200)
+- *(4) HashAggregate(keys=[region_type#26, cost#13, core_sector_value#21], functions=[count(core_id#22)])
+- Exchange hashpartitioning(region_type#26, cost#13, core_sector_value#21, 200)
+- *(3) HashAggregate(keys=[region_type#26, cost#13, core_sector_value#21], functions=[partial_count(core_id#22)])
+- *(3) Project [region_type#26, COST#13, CORE_SECTOR_VALUE#21, CORE_ID#22]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, CORE_ID#22 IN (DICT_ID_1#10,DICT_ID_2#11)
:- *(1) Project [CORE_SECTOR_VALUE#21, CORE_ID#22, region_type#26]
: +- *(1) Filter ((((isnotnull(response_value#23) && isnotnull(error_code#19L)) && (error_code#19L = 0)) && NOT (response_value#23 = )) && NOT response_value#23 IN (N.A.,N.D.,N.S.))
: +- *(1) FileScan parquet [ERROR_CODE#19L,CORE_SECTOR_VALUE#21,CORE_ID#22,RESPONSE_VALUE#23,source_system#24,fee_type#25,region_type#26,run_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/outfile/..., PartitionCount: 14, PartitionFilters: [isnotnull(run_date#27), (run_date#27 = 20190905)], PushedFilters: [IsNotNull(RESPONSE_VALUE), IsNotNull(ERROR_CODE), EqualTo(ERROR_CODE,0), Not(EqualTo(RESPONSE_VA..., ReadSchema: struct<ERROR_CODE:bigint,CORE_SECTOR_VALUE:string,CORE_ID:string,RESPONSE_VALUE:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(2) FileScan csv [DICT_ID_1#10,DICT_ID_2#11,COST#13] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/client..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DICT_ID_1:string,DICT_ID_2:string,COST:string>
Seeing the plan, I think that the transformations are getting too heavy for in-memory on spark local. Is it best practice to perform so many different step transformations or should I try to come up with a single query that would encompass all the business logic?
Additionally, could you please direct to any resource for understanding the Spark Plans we get using explain() function? Thanks
Seems like in left_outer operation:
# Final DF will have all columns from df1 and df2
final_df = df1.join(df2, df1.id.isin(df2.id_1, df2.id_2), 'left_outer')
final_df.show()
+-----+-----+-----+----+
| id| id_1| id_2|cost|
+-----+-----+-----+----+
|1_ghi|1_ghi|1_ghi| 12|
|2_mno|2_mno|2_rst| 86|
|3_xyz|3_def|3_xyz| 105|
|4_abc| null| null|null|
+-----+-----+-----+----+
# Select the required columns like id, cost etc.
final_df = df1.join(df2, df1.id.isin(df2.id_1, df2.id_2), 'left_outer').select('id','cost')
final_df.show()
+-----+----+
| id|cost|
+-----+----+
|1_ghi| 12|
|2_mno| 86|
|3_xyz| 105|
|4_abc|null|
+-----+----+
You can join twice and use coalesce
import pyspark.sql.functions as F
final_df = df_core\
.join(df_dict.select(F.col("id_1"), F.col("cost").alias("cost_1")), df_core.id== df_dict.id_1, 'left')\
.join(df_dict.select(F.col("id_2"), F.col("cost").alias("cost_2")), df_core.id== df_dict.id_2, 'left')\
.select(*[F.col(c) for c in df_core.columns], F.coalesce(F.col("cost_1"), F.col("cost_2")))

spark data frame left outer join is taking lot time

I have two dataframes ipwithCounryName(12Mb) and ipLogs(1GB) . I would like to join two data frames based on common column ipRange. ipwithCounryName df i brodcasted Below is my code.
val ipwithCounryName_df = Init.iptoCountryBC.value
ipwithCounryName_df .createOrReplaceTempView("inputTable")
ipLogs.createOrReplaceTempView("ipTable")
val joined_table= Init.getSparkSession.sql("SELECT hostname,date,path,status,content_size,inputTable.countryName FROM ipasLong Left JOIN inputTable ON ipasLongValue >= StartingRange AND ipasLongValue <= Endingrange")
=====Physical plan===
*Project [hostname#34, date#98, path#36, status#37, content_size#105L,
countryName#5]
+- BroadcastNestedLoopJoin BuildRight, Inner, ((ipasLongValue#354L >=
StartingRange#2L) && (ipasLongValue#354L <= Endingrange#3L))
:- *Project [UDF:IpToInt(hostname#34) AS IpasLongValue#354L, hostname#34,
date#98, path#36, status#37, content_size#105L]
: +- *Filter ((isnotnull(isIp#112) && isIp#112) &&
isnotnull(UDF:IpToInt(hostname#34)))
: +- InMemoryTableScan [path#36, content_size#105L, isIp#112,
hostname#34, date#98, status#37], [isnotnull(isIp#112), isIp#112,
isnotnull(UDF:IpToInt(hostname#34))]
: +- InMemoryRelation [hostname#34, date#98, path#36, status#37,
content_size#105L, isIp#112], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas)
: +- *Project [hostname#34, cast(unix_timestamp(date#35,
dd/MMM/yyyy:HH:mm:ss ZZZZ, Some(Asia/Calcutta)) as timestamp) AS date#98,
path#36, status#37, CASE WHEN isnull(content_size#38L) THEN 0 ELSE
content_size#38L END AS content_size#105L, UDF(hostname#34) AS isIp#112]
: +- *Filter (isnotnull(isBadData#45) && NOT isBadData#45)
: +- InMemoryTableScan [isBadData#45, hostname#34,
status#37, path#36, date#35, content_size#38L], [isnotnull(isBadData#45), NOT
isBadData#45]
: +- InMemoryRelation [hostname#34, date#35,
path#36, status#37, content_size#38L, isBadData#45], true, 10000,
StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *Project [regexp_extract(val#26,
^([^\s]+\s), 1) AS hostname#34, regexp_extract(val#26, ^.*
(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1) AS date#35,
regexp_extract(val#26, ^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1) AS path#36,
cast(regexp_extract(val#26, ^.*"\s+([^\s]+), 1) as int) AS status#37,
cast(regexp_extract(val#26, ^.*\s+(\d+)$, 1) as bigint) AS content_size#38L,
UDF(named_struct(hostname, regexp_extract(val#26, ^([^\s]+\s), 1), date,
regexp_extract(val#26, ^.*(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}), 1),
path, regexp_extract(val#26, ^.*"\w+\s+([^\s]+)\s*[(HTTP)]*.*", 1), status,
cast(regexp_extract(val#26, ^.*"\s+([^\s]+), 1) as int), content_size,
cast(regexp_extract(val#26, ^.*\s+(\d+)$, 1) as bigint))) AS isBadData#45]
: +- *FileScan csv [val#26] Batched:
false, Format: CSV, Location:
InMemoryFileIndex[file:/C:/Users/M1047320/Desktop/access_log_Jul95],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<val:string>
+- BroadcastExchange IdentityBroadcastMode
+- *Project [StartingRange#2L, Endingrange#3L, CountryName#5]
+- *Filter (isnotnull(StartingRange#2L) && isnotnull(Endingrange#3L))
+- *FileScan csv [StartingRange#2L,Endingrange#3L,CountryName#5] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/M1047320/Documents/Spark-301/Documents/GeoIPCountryWhois.csv], PartitionFilters: [], PushedFilters: [IsNotNull(StartingRange), IsNotNull(Endingrange)], ReadSchema: struct<StartingRange:bigint,Endingrange:bigint,CountryName:string>
Join is taking more time (>30 minutes). I have one more inner join on two different dataframe of same size where join condition is "=". Its taking only 5 minutes. How should i improve my code? Please suggest
Please keep the filter condition in where and join the tables based on common column name.I assummed countryname is the common across both DF.
val joined_table= Init.getSparkSession.sql("SELECT hostname,date,path,status,content_size,inputTable.countryName FROM ipasLong Left JOIN inputTable ON ipasLong.countryName=inputTable.countryName
WHERE ipasLongValue >= StartingRange AND ipasLongValue <= Endingrange")
You can also directly join the dataframes.
val result=ipLogs.join(broadcast(ipwithCounryName),"joincondition","left_outer").where($"ipasLongValue" >= StartingRange && $"ipasLongValue" <= Endingrange).select("select columns")
Hope it helps you.
You can try increasing your JVM parameters to the capacity of your system to fully utilize it like below:
spark-submit --driver-memory 12G --conf spark.driver.maxResultSize=3g --executor-cores 6 --executor-memory 16G

Cache not preventing multiple filescans?

I have a question regarding the usage of DataFram APIs cache. Consider the following query:
val dfA = spark.table(tablename)
.cache
val dfC = dfA
.join(dfA.groupBy($"day").count,Seq("day"),"left")
So dfA is used twice in this query, so I thought caching it would be benefical. But I'm confused about the plan, the table is still scanned twice (FileScan appearing twice):
dfC.explain
== Physical Plan ==
*Project [day#8232, i#8233, count#8251L]
+- SortMergeJoin [day#8232], [day#8255], LeftOuter
:- *Sort [day#8232 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(day#8232, 200)
: +- InMemoryTableScan [day#8232, i#8233]
: +- InMemoryRelation [day#8232, i#8233], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
+- *Sort [day#8255 ASC NULLS FIRST], false, 0
+- *HashAggregate(keys=[day#8255], functions=[count(1)])
+- Exchange hashpartitioning(day#8255, 200)
+- *HashAggregate(keys=[day#8255], functions=[partial_count(1)])
+- InMemoryTableScan [day#8255]
+- InMemoryRelation [day#8255, i#8256], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
Why isn't the table cached? Im using Spark 2.1.1
Try with count() after cache so you trigger one action and the caching is done before the plan of the second one is "calculated".
As far as I know, the first action will trigger the cache, but since Spark planning is not dynamic, if your first action after cache uses the table twice, it will have to read it twice (because it won't cache the table until it executes that action).
If the above doesn't work [and/or you are hitting the bug mentioned], it's probably related to the plan, you can also try transforming the DF to RDD and then back to RDD (this way the plan will be 100% exact).

Resources