How to load a bucketed DataFrame that would preserve bucketing? - apache-spark

I have bucketized a dataframe, i.e. bucketBy and saveAsTable.
If I load it with spark.read.parquet, I don't benefit from optimization (no shuffling).
scala> spark.read.parquet("${spark-warehouse}/tab1").groupBy("a").count.explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#35117], functions=[count(1)], output=[a#35117, count#35126L])
+- Exchange hashpartitioning(a#35117, 200)
+- *HashAggregate(keys=[a#35117], functions=[partial_count(1)], output=[a#35117, count#35132L])
+- *FileScan parquet [a#35117] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I need to load it with spark.table to benefit from optimization.
scala> spark.table("tab1").groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#35140L])
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#35146L])
+- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I don't understand why Spark do not detect automatically the bucketization in the first case, by using the filename for example that is a bit different in this case part-00007-ca117fc2-2552-4693-b6f7-6b27c7c4bca7_00001.snappy.parquet ?

I don't understand why Spark do not detect automatically the bucketization in the first case
Simple. No support for bucketed dataframes that are not loaded as bucketed tables using spark.table.

Related

Spark PushedFilters

When you do df.explain() it is possible to see in the Physical plan the PushedFilters for predicate pushdown as a string. This we can extract with df.queryExecution.simpleString but I want it as JSON so I can directly test whether something was put in PushedFilters, how do I extract this?
e.g. from jacek laswoski's website an example
cities.where('name === "Warsaw").queryExecution.executedPlan
res21: org.apache.spark.sql.execution.SparkPlan =
*Project [id#128L, name#129]
+- *Filter (isnotnull(name#129) && (name#129 = Warsaw))
+- *FileScan parquet [id#128L,name#129] Batched: true,
Format: ParquetFormat,
InputPaths: file:/Users/jacek/dev/oss/spark/cities.parquet,
PartitionFilters: [],
PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)],
ReadSchema: struct<id:bigint,name:string>
i want to be able to extract the PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)] for some testing I'm doing
Figured it out
df.queryExecution.sparkPlan.collectFirst{case p : FileSourceScanExec => p}.get.metadata("PushedFilters")

The difference on reading files in PySpark between reading the whole directory then filtering and reading a part of the directory?

Suppose I have a data model that runs daily and the sample HDFS path is
data_model/sales_summary/grass_date=2021-04-01
If I want to read all the models in Feb and March, what is the difference if I read in the following two ways:
A:
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*')
B:
spark.read.parquet('data_model/sales_summary/').filter(col('grass_date').between('2021-02-01', '2021-03-30'))
Are these two reading methods equivalent? If not, under what circumstances which one can be more efficient?
Spark will do a partition filter when reading the files, so the performance of the two methods should be similar. The query plans below show how the partition filters are used in the filescan operation.
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*').explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary/grass_date=2021-02-21, file:/tmp/data_model/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
spark.read.parquet('data_model/sales_summary/').filter(F.col('grass_date').between('2021-02-01', '2021-03-30')).explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#24,grass_date#25] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary], PartitionFilters: [isnotnull(grass_date#25), (grass_date#25 >= 18659), (grass_date#25 <= 18716)], PushedFilters: [], ReadSchema: struct<id:int>
But note that the partitioning column will be missing from the dataframe if you use the first method to read the files, so you'd probably prefer the second method.
We should use First one (A) is right .
A- We are selecting specific folders (we are reading required data only).
B - We are reading all data and then applying filter ( here we are reading all data which is costly ).

Spark: Weird partitioning on join

I have a Spark SQL query that works somewhat like this (actual fields have been omitted):
SELECT
a1.fieldA,
a1.fieldB,
a1.fieldC,
a1.fieldD,
a1.joinType
FROM sample_a a1
WHERE a1.joinType != "test"
UNION
SELECT
a2.fieldA,
a2.fieldB,
a2.fieldC,
b.fieldD,
a2.joinType
FROM sample_a a2
INNER JOIN sample_b b ON b.joinField = a2.joinField
WHERE a2.joinType = "test"
This is working perfectly fine but Spark will read sample_a twice. (From cache or disk)
I'm trying to get rid of the union and came up with the following solution:
SELECT
a.fieldA,
a.fieldB,
a.fieldC,
a.joinType,
CASE WHEN a.joinType = "test" THEN b.fieldD ELSE a.fieldD END as fieldD
FROM sample_a a
LEFT JOIN sample_b b ON a.joinType = "test" AND a.joinField = b.joinField
WHERE a.joinType != "test" OR (a.joinType = "test" AND b.joinField IS NOT NULL)
This should basically do the same thing but Spark is being very weird about it. While the first one keeps the partitions the same as sample_a (~1200) the second one will go down to 200 partitions, which is what sample_b has. It will also put a lot of data into a single partition. (Around 90% of data is in one of the 200 partitions)
The input data is stored in parquet files and not partitioned in any way. While sample_a has a much bigger file size, the joinField values for our joinType = "test" part are a subset of the joinField values in sample_b.
Edit: The physical plans look like this.
First Query:
Union
:- *(1) Project [fieldA#0, fieldD#1 joinType#2, joinField#3]
: +- *(1) Filter (isnotnull(joinType#2) && NOT (joinType#2 = test))
: +- *(1) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(joinType), Not(EqualTo(joinType,test))], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(6) Project [fieldA#0, fieldD#4 joinType#2, joinField#3]
+- *(6) SortMergeJoin [joinField#3], [joinField#5], Inner
:- *(3) Sort [joinField#3 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(joinField#3, 200)
: +- *(2) Project [fieldA#0, fieldD#1 joinType#2, joinField#3]
: +- *(2) Filter ((isnotnull(joinType#2) && (joinType#2 = test)) && isnotnull(joinField#3))
: +- *(2) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(joinType), EqualTo(joinType,test), IsNotNull(joinField)], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(5) Sort [joinField#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(joinField#5, 200)
+- *(4) FileScan parquet [fieldD#4, joinField#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldD:string,joinField:string>
Second Query:
*(5) Project [fieldA#0, CASE WHEN (joinType#2 = test) THEN fieldD#4 ELSE fieldD#1 END AS fieldD#6, joinType#2, joinField#3]
+- *(5) Filter (NOT (joinType#2 = test) || ((joinType#2 = test) && isnotnull(joinField#5)))
+- SortMergeJoin [joinField#3], [joinField#5], LeftOuter, (joinType#2 = test)
:- *(2) Sort [joinField#3 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(joinField#3, 200)
: +- *(1) FileScan parquet [fieldA#0, fieldD#1 joinType#2, joinField#3] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldA:string,fieldD:string,joinType:string,joinField:string>
+- *(4) Sort [joinField#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(joinField#5, 200)
+- *(3) FileScan parquet [fieldD#4, joinField#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<fieldD:string,joinField:string>

Cache not preventing multiple filescans?

I have a question regarding the usage of DataFram APIs cache. Consider the following query:
val dfA = spark.table(tablename)
.cache
val dfC = dfA
.join(dfA.groupBy($"day").count,Seq("day"),"left")
So dfA is used twice in this query, so I thought caching it would be benefical. But I'm confused about the plan, the table is still scanned twice (FileScan appearing twice):
dfC.explain
== Physical Plan ==
*Project [day#8232, i#8233, count#8251L]
+- SortMergeJoin [day#8232], [day#8255], LeftOuter
:- *Sort [day#8232 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(day#8232, 200)
: +- InMemoryTableScan [day#8232, i#8233]
: +- InMemoryRelation [day#8232, i#8233], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
+- *Sort [day#8255 ASC NULLS FIRST], false, 0
+- *HashAggregate(keys=[day#8255], functions=[count(1)])
+- Exchange hashpartitioning(day#8255, 200)
+- *HashAggregate(keys=[day#8255], functions=[partial_count(1)])
+- InMemoryTableScan [day#8255]
+- InMemoryRelation [day#8255, i#8256], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
Why isn't the table cached? Im using Spark 2.1.1
Try with count() after cache so you trigger one action and the caching is done before the plan of the second one is "calculated".
As far as I know, the first action will trigger the cache, but since Spark planning is not dynamic, if your first action after cache uses the table twice, it will have to read it twice (because it won't cache the table until it executes that action).
If the above doesn't work [and/or you are hitting the bug mentioned], it's probably related to the plan, you can also try transforming the DF to RDD and then back to RDD (this way the plan will be 100% exact).

How to avoid spark shuffle with an union on 2 bucketized dataframes

I have 2 dataframes that were bucketed on the same col.
scala> (1 to 10).map(i => (i, "element"+i))
res21: scala.collection.immutable.IndexedSeq[(Int, String)] = Vector((1,element1), (2,element2), (3,element3), (4,element4), (5,element5), (6,element6), (7,element7), (8,element8), (9,element9), (10,element10))
scala> spark.createDataFrame(res21).toDF("a", "b")
res22: org.apache.spark.sql.DataFrame = [a: int, b: string]
scala> res22.write.bucketBy(2, "a").saveAsTable("tab1")
17/10/17 23:07:50 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`tab1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
scala> res22.write.bucketBy(2, "a").saveAsTable("tab2")
17/10/17 23:07:54 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`tab2` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
If I perform an union of those dataframes, Spark is not able anymore to avoid shuffle.
scala> spark.table("tab1").union(spark.table("tab2")).groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#166L])
+- Exchange hashpartitioning(a#149, 200)
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#172L])
+- Union
:- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
+- *FileScan parquet default.tab2[a#154] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
Is there a workaround ?

Resources