How do I enable partition pruning in spark - apache-spark

I am reading parquet data and I see that it is listing all the directories on driver side
Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2016-01 on driver
Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2014-12 on driver
I have specified month=2014-12 in my where clause.
I have tried using spark sql and data frame API, and looks like both aren't pruning partitions.
Using Dataframe API
df.filter("month='2014-12'").show()
Using Spark SQL
sqlContext.sql("select name, price from products_parquet_151 where month = '2014-12'")
I have tried the above on versions 1.5.1, 1.6.1 and 2.0.0

Spark needs to load the partition metdata first in the driver to know whether the partition exists or not. Spark will query the directory to find existing partitions to know if it can prune the partition or not during the scanning of the data.
I've tested this on Spark 2.0 and you can see in the log messages.
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year on driver
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year/year=2007 on driver
This doesn't mean that we're scaning the files in each partition, but Spark will store the locations of the partitions for future queries on the table.
You can see the logs that it is actually passing in partition filters to prune the data:
16/10/14 17:23:48 TRACE ListingFileCatalog: Partition spec: PartitionSpec(StructType(StructField(year,IntegerType,true)),ArrayBuffer(PartitionDirectory([2012],s3a://mybucket/reddit_year/year=2012), PartitionDirectory([2010],s3a://mybucket/reddit_year/year=2010), ...PartitionDirectory([2015],s3a://mybucket/reddit_year/year=2015), PartitionDirectory([2011],s3a://mybucket/reddit_year/year=2011)))
16/10/14 17:23:48 INFO ListingFileCatalog: Selected 1 partitions out of 9, pruned 88.88888888888889% partitions.
You can see this in the logical plan if you run an explain(True) on your query:
spark.sql("select created_utc, score, name from reddit where year = '2014'").explain(True)
This will show you the plan and you can see that it is filtering at the bottom of the plan:
+- *BatchedScan parquet [created_utc#58,name#65,score#69L,year#74] Format: ParquetFormat, InputPaths: s3a://mybucket/reddit_year, PartitionFilters: [isnotnull(year#74), (cast(year#74 as double) = 2014.0)], PushedFilters: [], ReadSchema: struct<created_utc:string,name:string,score:bigint>

Spark has opportunities to improve its partition pruning when going via Hive; see SPARK-17179.
If you are just going direct to the object store, then the problem is that recursive directory operations against object stores are real performance killers. My colleagues and I have done work in the S3A client there HADOOP-11694 —and now need to follow it up with the changes to Spark to adopt the specific API calls we've been able to fix. For that though we need to make sure we are working with real datasets with real-word layouts, so don't optimise for specific examples/benchmarks.
For now, people should chose partition layouts which have shallow directory trees.

Related

How Pushed Filters work with Parquet files in databricks?

How does pushedFilters work while using parquet files ?
Below are the two queries that I submitted in databricks.
HighVolume = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.where("originating_base_num in ('B02764','B02617')").count()
HighVolume_wofilter = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.count()
Physical Plan: Clearly mentions PushedFilter is not null for HighVolume dataframe.
HighVolume :
PushedFilters: [In(originating_base_num, [B02617,B02764])]
HighVolume_wofilter:
PushedFilters: []
But while checking Spark UI, I observed that spark is reading all the rows in both the cases ( ignoring the filters).
snippet:
HighVolume :
HighVolume_wofilter:
Can someone please help me understand that why instead of having the filters in Physical plan, all the rows are being read. ?
Thanks!
When you working with parquet there are few types of optimizations:
Skipping reading the not necessary files when table is partitioned, and there is a condition on that partition. In the explain it will be visible as PartitionFilters: [p#503 IN (1,2)] (p is the partition column). In this case Spark will read only files related to the given partitions - it's most efficient way for Parquet.
Skipping some data inside the files - Parquet format has internal statistics, such as, min/max per column, etc. that allows to skip reading blocks inside Parquet that doesn't contain your data. These filters will be shown as PushedFilters: [In(p, [1,2])]. But this may not be efficient if your data is inside min/max range, so Spark needs to read all blocks and filter on the Spark level.
P.S. Please take into account that Delta Lake format allows to access data more efficiently because of the data skipping, bloom filters, Z-ordering, etc.

Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L69
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
ANALYZE TABLE table_b PARTITION(ds='PARTITION_VALUE', hr='PARTITION_VALUE') COMPUTE STATISTICS;
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#converting-sort-merge-join-to-broadcast-join

How Spark SQL reads Parquet partitioned files

I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute.
Schema: houseId, deviceId, energy
The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.
I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.
Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();
The above code works well. I want to understand that how spark executes this query.
Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
Does Spark load only the required partitions from HDFS as per the query?
What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
Will it make a difference in execution time if I cache df4 dataframe above?
Does Spark read the whole Parquet file in memory from HDFS without looking at the query?
It shouldn't scan all data files, but it might in general, access metadata of all files.
Does Spark load only the required partitions from HDFS as per the query?
Yes, it does.
Does Spark load only the required partitions from HDFS as per the query?
It does not. Each query has its own execution plan.
Will it make a difference in execution time if I cache df4 dataframe above?
Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?
Yes, spark supports partition pruning.
Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.
These are the logs which shows partitions being listed to populate the cache.
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver
These are the logs showing pruning is happening.
App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.
Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.
Just a thought:
Spark API documentation for HadoopFsRelation says,
( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )
"...when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths
of input directories, and perform partition pruning before start
reading the data..."
So, i guess "listLeafFilesInParallel" could not be a problem.
A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673
In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the
issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all
the partitions are registered. Else, recover your partitions as shown in this link:
Hive doesn't read partitioned parquet files generated by Spark
Update:
I guess appropriate parquet block size and page size were set while writing the data.
Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach.
( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions )
Run a plain hive query and then compare by running a spark program.
Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.
similar question popped up here recently:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reads-all-leaf-directories-on-a-partitioned-Hive-table-td35997.html#a36007
This question is old but I thought I'd post the solution here as well.
spark.sql.hive.convertMetastoreParquet=false
will use the Hive parquet serde instead of the spark inbuilt parquet serde. Hive's Parquet serde will not do a listLeafFiles on all partitions, but only and directly read from the selected partitions. On tables with many partitions and files, this is much faster (and cheaper, too). Feel free to try it ou! :)

Resources