Related
Suppose I have a data model that runs daily and the sample HDFS path is
data_model/sales_summary/grass_date=2021-04-01
If I want to read all the models in Feb and March, what is the difference if I read in the following two ways:
A:
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*')
B:
spark.read.parquet('data_model/sales_summary/').filter(col('grass_date').between('2021-02-01', '2021-03-30'))
Are these two reading methods equivalent? If not, under what circumstances which one can be more efficient?
Spark will do a partition filter when reading the files, so the performance of the two methods should be similar. The query plans below show how the partition filters are used in the filescan operation.
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*').explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary/grass_date=2021-02-21, file:/tmp/data_model/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
spark.read.parquet('data_model/sales_summary/').filter(F.col('grass_date').between('2021-02-01', '2021-03-30')).explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#24,grass_date#25] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary], PartitionFilters: [isnotnull(grass_date#25), (grass_date#25 >= 18659), (grass_date#25 <= 18716)], PushedFilters: [], ReadSchema: struct<id:int>
But note that the partitioning column will be missing from the dataframe if you use the first method to read the files, so you'd probably prefer the second method.
We should use First one (A) is right .
A- We are selecting specific folders (we are reading required data only).
B - We are reading all data and then applying filter ( here we are reading all data which is costly ).
How can you view the partition filters and pushed filters in Spark 3 (3.0.0-preview2)?
The explain method outputted detail like this in Spark 2:
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
This would easily let you identify the PartitionFilters and PushedFilters.
In Spark 3, the explain is a lot less, even when the extended argument is set:
val path = new java.io.File("./src/test/resources/person_data.csv").getCanonicalPath
val df = spark.read.option("header", "true").csv(path)
df
.filter(col("person_country") === "Cuba")
.explain("extended")
Here's the output:
== Parsed Logical Plan ==
'Filter ('person_country = Cuba)
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
== Analyzed Logical Plan ==Only 18s
person_name: string, person_country: string
Filter (person_country#116 = Cuba)
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
== Optimized Logical Plan ==
Filter (isnotnull(person_country#116) AND (person_country#116 = Cuba))
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
== Physical Plan ==
*(1) Project [person_name#115, person_country#116]
+- *(1) Filter (isnotnull(person_country#116) AND (person_country#116 = Cuba))
+- BatchScan[person_name#115, person_country#116] CSVScan Location: InMemoryFileIndex[file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/re..., ReadSchema: struct<person_name:string,person_country:string>
Is there any way to see the partition filters and pushed filters in Spark 3?
This looks like it was a bug that was fixed towards the end of April. The JIRA for the predicate pushdown is SPARK-30475 and for the partition pushdown is SPARK-30428.
Can you check if your version of Spark has this fix included in it?
I have a Spark SQL query that works somewhat like this (actual fields have been omitted):
SELECT
a1.fieldA,
a1.fieldB,
a1.fieldC,
a1.fieldD,
a1.joinType
FROM sample_a a1
WHERE a1.joinType != "test"
UNION
SELECT
a2.fieldA,
a2.fieldB,
a2.fieldC,
b.fieldD,
a2.joinType
FROM sample_a a2
INNER JOIN sample_b b ON b.joinField = a2.joinField
WHERE a2.joinType = "test"
This is working perfectly fine but Spark will read sample_a twice. (From cache or disk)
I'm trying to get rid of the union and came up with the following solution:
SELECT
a.fieldA,
a.fieldB,
a.fieldC,
a.joinType,
CASE WHEN a.joinType = "test" THEN b.fieldD ELSE a.fieldD END as fieldD
FROM sample_a a
LEFT JOIN sample_b b ON a.joinType = "test" AND a.joinField = b.joinField
WHERE a.joinType != "test" OR (a.joinType = "test" AND b.joinField IS NOT NULL)
This should basically do the same thing but Spark is being very weird about it. While the first one keeps the partitions the same as sample_a (~1200) the second one will go down to 200 partitions, which is what sample_b has. It will also put a lot of data into a single partition. (Around 90% of data is in one of the 200 partitions)
The input data is stored in parquet files and not partitioned in any way. While sample_a has a much bigger file size, the joinField values for our joinType = "test" part are a subset of the joinField values in sample_b.
Edit: The physical plans look like this.
First Query:
Union
:- *(1) Project [fieldA#0, fieldD#1 joinType#2, joinField#3]
: +- *(1) Filter (isnotnull(joinType#2) && NOT (joinType#2 = test))
: +- *(1) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(joinType), Not(EqualTo(joinType,test))], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(6) Project [fieldA#0, fieldD#4 joinType#2, joinField#3]
+- *(6) SortMergeJoin [joinField#3], [joinField#5], Inner
:- *(3) Sort [joinField#3 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(joinField#3, 200)
: +- *(2) Project [fieldA#0, fieldD#1 joinType#2, joinField#3]
: +- *(2) Filter ((isnotnull(joinType#2) && (joinType#2 = test)) && isnotnull(joinField#3))
: +- *(2) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(joinType), EqualTo(joinType,test), IsNotNull(joinField)], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(5) Sort [joinField#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(joinField#5, 200)
+- *(4) FileScan parquet [fieldD#4, joinField#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldD:string,joinField:string>
Second Query:
*(5) Project [fieldA#0, CASE WHEN (joinType#2 = test) THEN fieldD#4 ELSE fieldD#1 END AS fieldD#6, joinType#2, joinField#3]
+- *(5) Filter (NOT (joinType#2 = test) || ((joinType#2 = test) && isnotnull(joinField#5)))
+- SortMergeJoin [joinField#3], [joinField#5], LeftOuter, (joinType#2 = test)
:- *(2) Sort [joinField#3 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(joinField#3, 200)
: +- *(1) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(4) Sort [joinField#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(joinField#5, 200)
+- *(3) FileScan parquet [fieldD#4, joinField#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldD:string,joinField:string>
I have two spark dataframes, say df_core & df_dict:
There are more cols in df_core but it has nothing to do with the question here
df_core:
id
1_ghi
2_mno
3_xyz
4_abc
df_dict:
id_1 id_2 cost
1_ghi 1_ghi 12
2_mno 2_rst 86
3_def 3_xyz 105
I want to get the value from df_dict.cost by joining the 2 dfs.
Scenario: join on df_core.id == df_dict.id_1
If there is a no match for df_core.id for the foreign key df_dict.id_1 (for above example: 3_xyz) then, the join should happen on df_dict.id_2
I am able to achieve the join for the first key but have not sure about how to achieve the scenario
final_df = df_core.alias("df_core_alias").join(df_dict, df_core.id== df_dict.id_1, 'left').select('df_core_alias.*', df_dict.cost)
The solution need not be a dataframe operation. I can create Temp Views out of the dataframes & then run SQL on it if that's easy and/or optimized.
I also have a SQL solution in-mind (not tested):
SELECT
core.id,
dict.cost
FROM
df_core core LEFT JOIN df_dict dict
ON core.id = dict.id_1
OR core.id = dict.id_2
Expected df:
id cost
1_ghi 12
2_mno 86
3_xyz 105
4_abc
Well the project plan is too big to add in the comment so I've to question here
below is the spark plan for isin:
== Physical Plan ==
*(3) Project [region_type#26, COST#13, CORE_SECTOR_VALUE#21, CORE_ID#22]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, CORE_ID#22 IN (DICT_ID_1#10,DICT_ID_2#11)
:- *(1) Project [CORE_SECTOR_VALUE#21, CORE_ID#22, region_type#26]
: +- *(1) Filter ((((isnotnull(response_value#23) && isnotnull(error_code#19L)) && (error_code#19L = 0)) && NOT (response_value#23 = )) && NOT response_value#23 IN (N.A.,N.D.,N.S.))
: +- *(1) FileScan parquet [ERROR_CODE#19L,CORE_SECTOR_VALUE#21,CORE_ID#22,RESPONSE_VALUE#23,source_system#24,fee_type#25,region_type#26,run_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/outfile/..., PartitionCount: 14, PartitionFilters: [isnotnull(run_date#27), (run_date#27 = 20190905)], PushedFilters: [IsNotNull(RESPONSE_VALUE), IsNotNull(ERROR_CODE), EqualTo(ERROR_CODE,0), Not(EqualTo(RESPONSE_VA..., ReadSchema: struct<ERROR_CODE:bigint,CORE_SECTOR_VALUE:string,CORE_ID:string,RESPONSE_VALUE:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(2) FileScan csv [DICT_ID_1#10,DICT_ID_2#11,COST#13] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/client..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DICT_ID_1:string,DICT_ID_2:string,COST:string>
The Filter in BroadcastNestedLoopJoin is coming from previous df_core transformations but as we know spark's lazy-evaluation, we're seeing it here in the project plan
Moreover, I just realized that the final_df.show() works fine for any solution I use. But what's taking infinite time to process is the next transformation that I'm doing over the final_df which is my actual expected_df. Here's my next transformation:
expected_df = spark.sql("select region_type, cost, core_sector_value, count(core_id) from final_df_view group by region_type, cost, core_sector_value order by region_type, cost, core_sector_value")
& here's the plan for the expected_df:
== Physical Plan ==
*(5) Sort [region_type#26 ASC NULLS FIRST, cost#13 ASC NULLS FIRST, core_sector_value#21 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(region_type#26 ASC NULLS FIRST, cost#13 ASC NULLS FIRST, core_sector_value#21 ASC NULLS FIRST, 200)
+- *(4) HashAggregate(keys=[region_type#26, cost#13, core_sector_value#21], functions=[count(core_id#22)])
+- Exchange hashpartitioning(region_type#26, cost#13, core_sector_value#21, 200)
+- *(3) HashAggregate(keys=[region_type#26, cost#13, core_sector_value#21], functions=[partial_count(core_id#22)])
+- *(3) Project [region_type#26, COST#13, CORE_SECTOR_VALUE#21, CORE_ID#22]
+- BroadcastNestedLoopJoin BuildRight, LeftOuter, CORE_ID#22 IN (DICT_ID_1#10,DICT_ID_2#11)
:- *(1) Project [CORE_SECTOR_VALUE#21, CORE_ID#22, region_type#26]
: +- *(1) Filter ((((isnotnull(response_value#23) && isnotnull(error_code#19L)) && (error_code#19L = 0)) && NOT (response_value#23 = )) && NOT response_value#23 IN (N.A.,N.D.,N.S.))
: +- *(1) FileScan parquet [ERROR_CODE#19L,CORE_SECTOR_VALUE#21,CORE_ID#22,RESPONSE_VALUE#23,source_system#24,fee_type#25,region_type#26,run_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/outfile/..., PartitionCount: 14, PartitionFilters: [isnotnull(run_date#27), (run_date#27 = 20190905)], PushedFilters: [IsNotNull(RESPONSE_VALUE), IsNotNull(ERROR_CODE), EqualTo(ERROR_CODE,0), Not(EqualTo(RESPONSE_VA..., ReadSchema: struct<ERROR_CODE:bigint,CORE_SECTOR_VALUE:string,CORE_ID:string,RESPONSE_VALUE:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(2) FileScan csv [DICT_ID_1#10,DICT_ID_2#11,COST#13] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/XXXXXX/datafiles/client..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DICT_ID_1:string,DICT_ID_2:string,COST:string>
Seeing the plan, I think that the transformations are getting too heavy for in-memory on spark local. Is it best practice to perform so many different step transformations or should I try to come up with a single query that would encompass all the business logic?
Additionally, could you please direct to any resource for understanding the Spark Plans we get using explain() function? Thanks
Seems like in left_outer operation:
# Final DF will have all columns from df1 and df2
final_df = df1.join(df2, df1.id.isin(df2.id_1, df2.id_2), 'left_outer')
final_df.show()
+-----+-----+-----+----+
| id| id_1| id_2|cost|
+-----+-----+-----+----+
|1_ghi|1_ghi|1_ghi| 12|
|2_mno|2_mno|2_rst| 86|
|3_xyz|3_def|3_xyz| 105|
|4_abc| null| null|null|
+-----+-----+-----+----+
# Select the required columns like id, cost etc.
final_df = df1.join(df2, df1.id.isin(df2.id_1, df2.id_2), 'left_outer').select('id','cost')
final_df.show()
+-----+----+
| id|cost|
+-----+----+
|1_ghi| 12|
|2_mno| 86|
|3_xyz| 105|
|4_abc|null|
+-----+----+
You can join twice and use coalesce
import pyspark.sql.functions as F
final_df = df_core\
.join(df_dict.select(F.col("id_1"), F.col("cost").alias("cost_1")), df_core.id== df_dict.id_1, 'left')\
.join(df_dict.select(F.col("id_2"), F.col("cost").alias("cost_2")), df_core.id== df_dict.id_2, 'left')\
.select(*[F.col(c) for c in df_core.columns], F.coalesce(F.col("cost_1"), F.col("cost_2")))
I have bucketized a dataframe, i.e. bucketBy and saveAsTable.
If I load it with spark.read.parquet, I don't benefit from optimization (no shuffling).
scala> spark.read.parquet("${spark-warehouse}/tab1").groupBy("a").count.explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#35117], functions=[count(1)], output=[a#35117, count#35126L])
+- Exchange hashpartitioning(a#35117, 200)
+- *HashAggregate(keys=[a#35117], functions=[partial_count(1)], output=[a#35117, count#35132L])
+- *FileScan parquet [a#35117] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I need to load it with spark.table to benefit from optimization.
scala> spark.table("tab1").groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#35140L])
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#35146L])
+- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I don't understand why Spark do not detect automatically the bucketization in the first case, by using the filename for example that is a bit different in this case part-00007-ca117fc2-2552-4693-b6f7-6b27c7c4bca7_00001.snappy.parquet ?
I don't understand why Spark do not detect automatically the bucketization in the first case
Simple. No support for bucketed dataframes that are not loaded as bucketed tables using spark.table.