How to avoid spark shuffle with an union on 2 bucketized dataframes - apache-spark

I have 2 dataframes that were bucketed on the same col.
scala> (1 to 10).map(i => (i, "element"+i))
res21: scala.collection.immutable.IndexedSeq[(Int, String)] = Vector((1,element1), (2,element2), (3,element3), (4,element4), (5,element5), (6,element6), (7,element7), (8,element8), (9,element9), (10,element10))
scala> spark.createDataFrame(res21).toDF("a", "b")
res22: org.apache.spark.sql.DataFrame = [a: int, b: string]
scala> res22.write.bucketBy(2, "a").saveAsTable("tab1")
17/10/17 23:07:50 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`tab1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
scala> res22.write.bucketBy(2, "a").saveAsTable("tab2")
17/10/17 23:07:54 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`tab2` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
If I perform an union of those dataframes, Spark is not able anymore to avoid shuffle.
scala> spark.table("tab1").union(spark.table("tab2")).groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#166L])
+- Exchange hashpartitioning(a#149, 200)
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#172L])
+- Union
:- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
+- *FileScan parquet default.tab2[a#154] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
Is there a workaround ?

Related

Does spark bring entire hive table to memory

I am in the process of learning the working of Apache Spark and have some basic queries. Let's say I have a Spark application running which connects to a Hive table.
My hive table is as follows:
Name
Age
Marks
A
50
100
B
50
100
C
75
200
When I run the following code snippets, which rows and columns will be loaded into memory during the execution? Will the filtering of rows/columns be done after the entire table is loaded into the memory?
1. spark_session.sql("SELECT name, age from table").collect()
2. spark_session.sql("SELECT * from table WHERE age=50").collect()
3. spark_session.sql("SELECT * from table").select("name", "age").collect()
4. spark_session.sql("SELECT * from table").filter(df.age = 50).collect()
If the datasource supports predicate pushdown then spark will not load entire data to memory while filtering the data.
Let's check the spark plan for hive table with parquet as file format:
>>> df = spark.createDataFrame([('A', 25, 100),('B', 30, 100)], ['name', 'age', 'marks'])
>>> df.write.saveAsTable('table')
>>> spark.sql('select * from table where age=25').explain(True)
== Physical Plan ==
*(1) Filter (isnotnull(age#1389L) AND (age#1389L = 25))
+- *(1) ColumnarToRow
+- FileScan parquet default.table[name#1388,age#1389L,marks#1390L] Batched: true, DataFilters: [isnotnull(age#1389L), (age#1389L = 25)],
Format: Parquet, Location: InMemoryFileIndex[file:/Users/mohan/spark-warehouse/table],
PartitionFilters: [], PushedFilters: [IsNotNull(age), EqualTo(age,25)], ReadSchema: struct<name:string,age:bigint,marks:bigint>
You can verify if filter pushed to underlying storage by looking at PushedFilters: [IsNotNull(age), EqualTo(age,25)]

Spark PushedFilters

When you do df.explain() it is possible to see in the Physical plan the PushedFilters for predicate pushdown as a string. This we can extract with df.queryExecution.simpleString but I want it as JSON so I can directly test whether something was put in PushedFilters, how do I extract this?
e.g. from jacek laswoski's website an example
cities.where('name === "Warsaw").queryExecution.executedPlan
res21: org.apache.spark.sql.execution.SparkPlan =
*Project [id#128L, name#129]
+- *Filter (isnotnull(name#129) && (name#129 = Warsaw))
+- *FileScan parquet [id#128L,name#129] Batched: true,
Format: ParquetFormat,
InputPaths: file:/Users/jacek/dev/oss/spark/cities.parquet,
PartitionFilters: [],
PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)],
ReadSchema: struct<id:bigint,name:string>
i want to be able to extract the PushedFilters: [IsNotNull(name), EqualTo(name,Warsaw)] for some testing I'm doing
Figured it out
df.queryExecution.sparkPlan.collectFirst{case p : FileSourceScanExec => p}.get.metadata("PushedFilters")

The difference on reading files in PySpark between reading the whole directory then filtering and reading a part of the directory?

Suppose I have a data model that runs daily and the sample HDFS path is
data_model/sales_summary/grass_date=2021-04-01
If I want to read all the models in Feb and March, what is the difference if I read in the following two ways:
A:
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*')
B:
spark.read.parquet('data_model/sales_summary/').filter(col('grass_date').between('2021-02-01', '2021-03-30'))
Are these two reading methods equivalent? If not, under what circumstances which one can be more efficient?
Spark will do a partition filter when reading the files, so the performance of the two methods should be similar. The query plans below show how the partition filters are used in the filescan operation.
spark.read.parquet('data_model/sales_summary/grass_date=2021-0{2,3}*').explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary/grass_date=2021-02-21, file:/tmp/data_model/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
spark.read.parquet('data_model/sales_summary/').filter(F.col('grass_date').between('2021-02-01', '2021-03-30')).explain()
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [id#24,grass_date#25] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data_model/sales_summary], PartitionFilters: [isnotnull(grass_date#25), (grass_date#25 >= 18659), (grass_date#25 <= 18716)], PushedFilters: [], ReadSchema: struct<id:int>
But note that the partitioning column will be missing from the dataframe if you use the first method to read the files, so you'd probably prefer the second method.
We should use First one (A) is right .
A- We are selecting specific folders (we are reading required data only).
B - We are reading all data and then applying filter ( here we are reading all data which is costly ).

Why is my parquet partitioned data slower than non-partitioned one?

My understanding is: If I partition my data on a column I will query by it should be faster. However, when I tried it, it seem to be slower instead why?
I have a users dataframe which I tried partitioning my yearmonth and not.
So I have 1 dataset partitioned by creation_yearmonth.
questionsCleanedDf.repartition("creation_yearmonth") \
.write.partitionBy('creation_yearmonth') \
.parquet('wasb://.../parquet/questions.parquet')
I have another not partitioned
questionsCleanedDf \
.write \
.parquet('wasb://.../parquet/questions_nopartition.parquet')
Then I tried creating a dataframe from these 2 parquet files and running the same query
questionsDf = spark.read.parquet('wasb://.../parquet/questions.parquet')
and
questionsDf = spark.read.parquet('wasb://.../parquet/questions_nopartition.parquet')
The query
spark.sql("""
SELECT * FROM questions
WHERE creation_yearmonth = 201606
""")
It seem like the no partition one is consistently faster or have similar times (~2 - 3s) while partitioned one is slighly slower (~3 - 4s).
I tried to do an explain:
For the partitioned dataset:
== Physical Plan ==
*FileScan parquet [id#6404,title#6405,tags#6406,owner_user_id#6407,accepted_answer_id#6408,view_count#6409,answer_count#6410,comment_count#6411,creation_date#6412,favorite_count#6413,creation_yearmonth#6414] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 1, PartitionFilters: [isnotnull(creation_yearmonth#6414), (creation_yearmonth#6414 = 201606)], PushedFilters: [], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...
PartitionCount: 1 I should since in this case, it can just go directly to the parition it should be faster?
For the non-paritioned one:
== Physical Plan ==
*Project [id#6440, title#6441, tags#6442, owner_user_id#6443, accepted_answer_id#6444, view_count#6445, answer_count#6446, comment_count#6447, creation_date#6448, favorite_count#6449, creation_yearmonth#6450]
+- *Filter (isnotnull(creation_yearmonth#6450) && (creation_yearmonth#6450 = 201606))
+- *FileScan parquet [id#6440,title#6441,tags#6442,owner_user_id#6443,accepted_answer_id#6444,view_count#6445,answer_count#6446,comment_count#6447,creation_date#6448,favorite_count#6449,creation_yearmonth#6450] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/parquet/questions_nopartition.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_yearmonth), EqualTo(creation_yearmonth,201606)], ReadSchema: struct<id:int,title:string,tags:array<string>,owner_user_id:int,accepted_answer_id:int,view_count...
Also very surprising. At first the dataset has dates as strings, so I need to do a query like:
spark.sql("""
SELECT * FROM questions
WHERE CAST(creation_date AS date) BETWEEN '2017-06-01' AND '2017-07-01'
""").show(20, False)
I expected this to be even slower but it turns out, it performs the best ~1-2s. Why is that? I thought in this case, it needs to cast each row?
The explain output here:
== Physical Plan ==
*Project [id#6521, title#6522, tags#6523, owner_user_id#6524, accepted_answer_id#6525, view_count#6526, answer_count#6527, comment_count#6528, creation_date#6529, favorite_count#6530]
+- *Filter ((isnotnull(creation_date#6529) && (cast(cast(creation_date#6529 as date) as string) >= 2017-06-01)) && (cast(cast(creation_date#6529 as date) as string) <= 2017-07-01))
+- *FileScan parquet [id#6521,title#6522,tags#6523,owner_user_id#6524,accepted_answer_id#6525,view_count#6526,answer_count#6527,comment_count#6528,creation_date#6529,favorite_count#6530] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data#cs4225.blob.core.windows.net/filtered/questions.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(creation_date)], ReadSchema: struct<id:string,title:string,tags:array<string>,owner_user_id:string,accepted_answer_id:string,v...
Overpartitioning can actually reduce performance:
If a column has only a few rows matching each value, the number of
directories to process can become a limiting factor, and the data file
in each directory could be too small to take advantage of the Hadoop
mechanism for transmitting data in multi-megabyte blocks.
This excerpt was taken from the documentation of a different Hadoop component, Impala, but the presented argument should be valid to all components of the Hadoop stack.
I think that regardless of the partitioning scheme used, the advantages of partitioning will not be apparent until the table grows way beyond 900 MB-s.

How to load a bucketed DataFrame that would preserve bucketing?

I have bucketized a dataframe, i.e. bucketBy and saveAsTable.
If I load it with spark.read.parquet, I don't benefit from optimization (no shuffling).
scala> spark.read.parquet("${spark-warehouse}/tab1").groupBy("a").count.explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#35117], functions=[count(1)], output=[a#35117, count#35126L])
+- Exchange hashpartitioning(a#35117, 200)
+- *HashAggregate(keys=[a#35117], functions=[partial_count(1)], output=[a#35117, count#35132L])
+- *FileScan parquet [a#35117] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I need to load it with spark.table to benefit from optimization.
scala> spark.table("tab1").groupBy("a").count().explain(true)
== Physical Plan ==
*HashAggregate(keys=[a#149], functions=[count(1)], output=[a#149, count#35140L])
+- *HashAggregate(keys=[a#149], functions=[partial_count(1)], output=[a#149, count#35146L])
+- *FileScan parquet default.tab1[a#149] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/yann.moisan/projects/teads/data/spark-warehouse/tab1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int>
I don't understand why Spark do not detect automatically the bucketization in the first case, by using the filename for example that is a bit different in this case part-00007-ca117fc2-2552-4693-b6f7-6b27c7c4bca7_00001.snappy.parquet ?
I don't understand why Spark do not detect automatically the bucketization in the first case
Simple. No support for bucketed dataframes that are not loaded as bucketed tables using spark.table.

Resources