How Pushed Filters work with Parquet files in databricks? - apache-spark

How does pushedFilters work while using parquet files ?
Below are the two queries that I submitted in databricks.
HighVolume = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.where("originating_base_num in ('B02764','B02617')").count()
HighVolume_wofilter = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.count()
Physical Plan: Clearly mentions PushedFilter is not null for HighVolume dataframe.
HighVolume :
PushedFilters: [In(originating_base_num, [B02617,B02764])]
HighVolume_wofilter:
PushedFilters: []
But while checking Spark UI, I observed that spark is reading all the rows in both the cases ( ignoring the filters).
snippet:
HighVolume :
HighVolume_wofilter:
Can someone please help me understand that why instead of having the filters in Physical plan, all the rows are being read. ?
Thanks!

When you working with parquet there are few types of optimizations:
Skipping reading the not necessary files when table is partitioned, and there is a condition on that partition. In the explain it will be visible as PartitionFilters: [p#503 IN (1,2)] (p is the partition column). In this case Spark will read only files related to the given partitions - it's most efficient way for Parquet.
Skipping some data inside the files - Parquet format has internal statistics, such as, min/max per column, etc. that allows to skip reading blocks inside Parquet that doesn't contain your data. These filters will be shown as PushedFilters: [In(p, [1,2])]. But this may not be efficient if your data is inside min/max range, so Spark needs to read all blocks and filter on the Spark level.
P.S. Please take into account that Delta Lake format allows to access data more efficiently because of the data skipping, bloom filters, Z-ordering, etc.

Related

Can Spark in Foundry use Partition Pruning

We have a dataset which runs as an incremental build on our Foundry instance.
The dataset is a large time series dataset (56.5 billion rows, 10 columns, 965GB), with timestamps in 1 hour buckets. The dataset grows by around 10GB per day.
In order to optimise the dataset for analysis purposes, we have repartitioned the dataset on two attributes “measure_date” and “measuring_time”.
This reflects the access pattern - the data set is usually accessed by "measure_date". We sub-partition this by "measuring_time" to decrease the size of parquet files being produced, plus filtering on time is a common access pattern as well.
The code which creates the partition is as following:
if ctx.is_incremental:
return df.repartition(24, "measure_date", "measuring_time")
else:
return df.repartition(2200, "measure_date", "measuring_time")
Using the hash partion creates unbalanced file sizes, but this is topic of a different post.
I am now trying to find out, how to make Spark on Foundry utilize the partitions in filter criteria. From what I can see, this is NOT the case.
I created a code workbook and ran the following query on the telemetry data, saving the result to another data set.
SELECT *
FROM telemetry_data
where measure_date = '2022-06-05'
The pyhsical query plan of the build seems to indicate, that Spark is not utilizing any partition, with PartitionFilters being empty in the plan.
Batched: true, BucketedScan: false, DataFilters: [isnotnull(measure_date#170), (measure_date#170 = 19148)],
Format: Parquet, Location: InMemoryFileIndex[sparkfoundry://prodapp06.palantir:8101/datasets/ri.foundry.main.dataset.xxx...,
PartitionFilters: [],
PushedFilters: [IsNotNull(measure_date), EqualTo(measure_date,2022-06-05)],
ReadSchema: struct<xxx,measure_date:date,measuring_time_cet:timestamp,fxxx, ScanMode: RegularMode
How can I make Spark on Foundry use partition pruning?
I believe you need to use
transforms.api.IncrementalTransformOutput.write_dataframe()
with
partitionBy=['measure_date', 'measuring_time']
to achieve what you are looking for.
Check the foundry docs for more.

How parquet columns can be skipped when reading from hdfs?

We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?
Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)

How does avro partition pruning work internally?

I have a daily job which converts avro to parquet.
Avro file per hour is 20G and is partitioned by year, month, day and hour
when I read the avro file like the way below,
spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020 and $month=9 and $day=1 and $hour=1).write.paritionBy(paritionCol).parquet(path) - the job runs for 1.5 hours
Note: The whole folder basePath has 36 TB of data in avro format
But, the below command runs for just 7 minutes for the same spark configuration(memory and instances etc.).
spark.read.format("com.databricks.spark.avro").option("basePath", basePath).load(basePath + "year=2020/month=9/day=1/hour=1/").write.paritionBy(paritionCol).parquet(path).
Why there is such a drastic reduction of time? How avro does partition pruning internally?
there are a huge difference.
In the first case you will read all file then filter, in the second case you will read only the selected file (the filter is already done by the partitioning).
you can inspect if the filter is predicate pushdown by using explain() function. In your FileScan avro you will see PushedFilters and PartitionFilters
In your case, your filter is not predicate pushdown.
You can find more informations here : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html

Spark - Stream kafka to file that changes every day?

I have a kafka stream I will be processing in spark. I want to write the output of this stream to a file. However, I want to partition these files by day, so everyday it will start writing to a new file. Can something like this be done? I want this to be left running and when a new day occurs, it will switch to write to a new file.
val streamInputDf = spark.readStream.format("kafka")
.option("kafka.bootstrapservers", "XXXX")
.option("subscribe", "XXXX")
.load()
val streamSelectDf = streamInputDf.select(...)
streamSelectDf.writeStream.format("parquet)
.option("path", "xxx")
???
Adding partition from spark can be done with partitionBy provided in
DataFrameWriter for non-streamed or with DataStreamWriter for
streamed data.
Below are the signatures :
public DataFrameWriter partitionBy(scala.collection.Seq
colNames)
DataStreamWriter partitionBy(scala.collection.Seq colNames)
Partitions the output by the given columns on the file system.
DataStreamWriter partitionBy(String... colNames) Partitions the
output by the given columns on the file system.
Description :
partitionBy public DataStreamWriter partitionBy(String... colNames)
Partitions the output by the given columns on the file system. If
specified, the output is laid out on the file system similar to Hive's
partitioning scheme. As an example, when we partition a dataset by
year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize
physical data layout. It provides a coarse-grained index for skipping
unnecessary data reads when queries have predicates on the partitioned
columns. In order for partitioning to work well, the number of
distinct values in each column should typically be less than tens of
thousands.
Parameters: colNames - (undocumented) Returns: (undocumented) Since:
2.0.0
so if you want to partition data by year and month spark will save the data to folder like:
year=2019/month=01/05
year=2019/month=02/05
Option 1 (Direct write):
You have mentioned parquet - you can use saving as a parquet format with:
df.write.partitionBy('year', 'month','day').format("parquet").save(path)
Option 2 (insert in to hive using same partitionBy ):
You can also insert into hive table like:
df.write.partitionBy('year', 'month', 'day').insertInto(String tableName)
Getting all hive partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
To get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveParquetTable")
Conclusion :
I would suggest option 2 ... since Advantage is later you can query data based on partition (aka query on raw data to know what you have received) and underlying file can be parquet or orc.
Note :
Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly.
Based on this answer spark should be able to write to a folder based on the year, month and day, which seems to be exactly what you are looking for. Have not tried it in spark streaming, but hopefully this example gets you on the right track:
df.write.partitionBy("year", "month", "day").format("parquet").save(outPath)
If not, you might be able to put in a variable filepath based on current_date()

How do I enable partition pruning in spark

I am reading parquet data and I see that it is listing all the directories on driver side
Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2016-01 on driver
Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2014-12 on driver
I have specified month=2014-12 in my where clause.
I have tried using spark sql and data frame API, and looks like both aren't pruning partitions.
Using Dataframe API
df.filter("month='2014-12'").show()
Using Spark SQL
sqlContext.sql("select name, price from products_parquet_151 where month = '2014-12'")
I have tried the above on versions 1.5.1, 1.6.1 and 2.0.0
Spark needs to load the partition metdata first in the driver to know whether the partition exists or not. Spark will query the directory to find existing partitions to know if it can prune the partition or not during the scanning of the data.
I've tested this on Spark 2.0 and you can see in the log messages.
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year on driver
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year/year=2007 on driver
This doesn't mean that we're scaning the files in each partition, but Spark will store the locations of the partitions for future queries on the table.
You can see the logs that it is actually passing in partition filters to prune the data:
16/10/14 17:23:48 TRACE ListingFileCatalog: Partition spec: PartitionSpec(StructType(StructField(year,IntegerType,true)),ArrayBuffer(PartitionDirectory([2012],s3a://mybucket/reddit_year/year=2012), PartitionDirectory([2010],s3a://mybucket/reddit_year/year=2010), ...PartitionDirectory([2015],s3a://mybucket/reddit_year/year=2015), PartitionDirectory([2011],s3a://mybucket/reddit_year/year=2011)))
16/10/14 17:23:48 INFO ListingFileCatalog: Selected 1 partitions out of 9, pruned 88.88888888888889% partitions.
You can see this in the logical plan if you run an explain(True) on your query:
spark.sql("select created_utc, score, name from reddit where year = '2014'").explain(True)
This will show you the plan and you can see that it is filtering at the bottom of the plan:
+- *BatchedScan parquet [created_utc#58,name#65,score#69L,year#74] Format: ParquetFormat, InputPaths: s3a://mybucket/reddit_year, PartitionFilters: [isnotnull(year#74), (cast(year#74 as double) = 2014.0)], PushedFilters: [], ReadSchema: struct<created_utc:string,name:string,score:bigint>
Spark has opportunities to improve its partition pruning when going via Hive; see SPARK-17179.
If you are just going direct to the object store, then the problem is that recursive directory operations against object stores are real performance killers. My colleagues and I have done work in the S3A client there HADOOP-11694 —and now need to follow it up with the changes to Spark to adopt the specific API calls we've been able to fix. For that though we need to make sure we are working with real datasets with real-word layouts, so don't optimise for specific examples/benchmarks.
For now, people should chose partition layouts which have shallow directory trees.

Resources