We have a dataset which runs as an incremental build on our Foundry instance.
The dataset is a large time series dataset (56.5 billion rows, 10 columns, 965GB), with timestamps in 1 hour buckets. The dataset grows by around 10GB per day.
In order to optimise the dataset for analysis purposes, we have repartitioned the dataset on two attributes “measure_date” and “measuring_time”.
This reflects the access pattern - the data set is usually accessed by "measure_date". We sub-partition this by "measuring_time" to decrease the size of parquet files being produced, plus filtering on time is a common access pattern as well.
The code which creates the partition is as following:
if ctx.is_incremental:
return df.repartition(24, "measure_date", "measuring_time")
else:
return df.repartition(2200, "measure_date", "measuring_time")
Using the hash partion creates unbalanced file sizes, but this is topic of a different post.
I am now trying to find out, how to make Spark on Foundry utilize the partitions in filter criteria. From what I can see, this is NOT the case.
I created a code workbook and ran the following query on the telemetry data, saving the result to another data set.
SELECT *
FROM telemetry_data
where measure_date = '2022-06-05'
The pyhsical query plan of the build seems to indicate, that Spark is not utilizing any partition, with PartitionFilters being empty in the plan.
Batched: true, BucketedScan: false, DataFilters: [isnotnull(measure_date#170), (measure_date#170 = 19148)],
Format: Parquet, Location: InMemoryFileIndex[sparkfoundry://prodapp06.palantir:8101/datasets/ri.foundry.main.dataset.xxx...,
PartitionFilters: [],
PushedFilters: [IsNotNull(measure_date), EqualTo(measure_date,2022-06-05)],
ReadSchema: struct<xxx,measure_date:date,measuring_time_cet:timestamp,fxxx, ScanMode: RegularMode
How can I make Spark on Foundry use partition pruning?
I believe you need to use
transforms.api.IncrementalTransformOutput.write_dataframe()
with
partitionBy=['measure_date', 'measuring_time']
to achieve what you are looking for.
Check the foundry docs for more.
Related
I have a large dataset (hundreds of millions of rows) that I need to heavily process using spark with Databricks. This dataset has tens of columns, typically an integer, float, or array of integers.
My question is: does it make any difference if I drop some columns that are not needed before processing the data? In terms of memory and/or processing speed?
It depends what are you going to do with this dataset. Spark is smart enough to figure out which column are really needed, but its not always that easy. For example when you use UDF (user defined fucntion) which is operating on case class with all column defined, all column are going to be select from source as from Spark perspective such UDF is a black box.
You can check which column are selected for your job via SparkUI. For example check out this blog post: https://medium.com/swlh/spark-ui-to-debug-queries-3ba43279efee
In your plan you can look for this line: PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string>
In ReadSchema you will be able to figure out which column are read by Spark and if they are really needed in our processing
How does pushedFilters work while using parquet files ?
Below are the two queries that I submitted in databricks.
HighVolume = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.where("originating_base_num in ('B02764','B02617')").count()
HighVolume_wofilter = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.count()
Physical Plan: Clearly mentions PushedFilter is not null for HighVolume dataframe.
HighVolume :
PushedFilters: [In(originating_base_num, [B02617,B02764])]
HighVolume_wofilter:
PushedFilters: []
But while checking Spark UI, I observed that spark is reading all the rows in both the cases ( ignoring the filters).
snippet:
HighVolume :
HighVolume_wofilter:
Can someone please help me understand that why instead of having the filters in Physical plan, all the rows are being read. ?
Thanks!
When you working with parquet there are few types of optimizations:
Skipping reading the not necessary files when table is partitioned, and there is a condition on that partition. In the explain it will be visible as PartitionFilters: [p#503 IN (1,2)] (p is the partition column). In this case Spark will read only files related to the given partitions - it's most efficient way for Parquet.
Skipping some data inside the files - Parquet format has internal statistics, such as, min/max per column, etc. that allows to skip reading blocks inside Parquet that doesn't contain your data. These filters will be shown as PushedFilters: [In(p, [1,2])]. But this may not be efficient if your data is inside min/max range, so Spark needs to read all blocks and filter on the Spark level.
P.S. Please take into account that Delta Lake format allows to access data more efficiently because of the data skipping, bloom filters, Z-ordering, etc.
If I am understanding the documentation correctly, partitioning a dataframe, vs partitioning a hive or other on-disk table, seem to be different. For on-disk storage, partitioning by, say, date creates a set of partitions for each date which occurs in my dataset. This seems useful; if I query records for a given date, every node in my cluster processes only partitions corresponding to the date I want.
Dataframe.repartition, on the other hand, creates one partition for each date which occurs in my dataset. If I search for records from a specific date, they will all be found in a single partition and thus all processed by a single node.
Is this right? If so, what is the use case? What is the way to get the speed advantage of on-disk partitioning schemes in the context of a dataframe?
For what it's worth, I need the advantage after I do an aggregation of on-disk data, so the on-disk partitioning doesn't necessarily help me even with delayed execution.
In your example, Spark will be able to recover very quickly all the records linked to that date. That's an improvement.
In the following piece of code, you can see that the filter has been categorized as partition filter.
inputRdd = sc.parallelize([("fish", 1), ("cats",2), ("dogs",3)])
schema = StructType([StructField("animals", StringType(), True),
StructField("ID", IntegerType(), True)])
my_dataframe = inputRdd.toDF(schema)
my_dataframe.write.partitionBy('animals').parquet("home")
sqlContext.read.parquet('home').filter(col('animals') == 'fish').explain()
== Physical Plan ==
*(1) FileScan parquet [ID#35,animals#36] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/home], PartitionCount: 1, PartitionFilters: [isnotnull(animals#36), (animals#36 = fish)], PushedFilters: [], ReadSchema: struct<ID:int>
For a deeper insight, you may want to have a look at this.
I am actually not sure about your other question. You are probably right, in my example df.rdd.getNumPartitions() gives 1. And with one partition performances are not so great (but you have already read from the disk at this point). For the following steps calling repartition(n) will fix the problem but it may be quite costly.
Another possible improvement is related to joining two data frames that share the same partitioning (with the join keys being the partition columns), you will avoid a lot of shuffles in the join phase.
I am trying to see the difference between doing limits in Spark/AWS Glue
I tried using Spark SQL
spark.sql("SELECT * FROM flights LIMIT 10")
The explain looks something like:
CollectLimit 10
+- *FileScan parquet xxxxxx.flights[Id#31,...] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://xxxxxx/flights], PartitionCount: 14509, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
Then I tried using AWS Glue Data Catalog to see if its any faster
gdf = glueContext.create_dynamic_frame.from_catalog(database = "xxxxxx", table_name = "xxxxxx")
df = gdf.toDF()
df = df.limit(10)
df.explain(True)
df.show(10)
The explain looks like:
GlobalLimit 10
+- LocalLimit 10
+- LogicalRDD [Id#70, ...]
The first runs in 5 mins the 2nd runs in 4 mins, not that significant yet but I am thinking it appears that either querying the data catalog is faster or doing a limit in data frame is better than doing the limit in spark SQL?
Whats the difference between a collect vs global vs local limit? I am guessing local limit means it does limit locally then the driver will do the global limit to give the final result. But why is Spark SQL not also doing this optimization?
Does Spark reads all the underlying parquet files before doing any limit? Is there a way to tell spark to read until it gets just 10 rows in this example case?
SQL way , Programmatic Dataset creation -the control flow is the same in both cases, it goes through Spark SQL catalyst. In your case, when the query was run for the first time, it fetches metadata about the table from the metastore and caches it, in subsequent queries, it is reused, this might be the reason for the slowness in the first query.
There is no LogicalPlan node as CollectLimit, there is only CollectLimitExec physicalplan node. And limit is implemented as LocalLimit followed by GlobalLimit(link to code)
Spark performs limit incrementally.
It tries to retrieve the given number of rows using one partition.
If the number of rows is not satisfied, Spark then queries the next 4 partitions(determined by spark.sql.limit.scaleUpFactor, default 4), then 16 and so on until the limit is satisfied or the data is exhausted.
I am reading parquet data and I see that it is listing all the directories on driver side
Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2016-01 on driver
Listing s3://xxxx/defloc/warehouse/products_parquet_151/month=2014-12 on driver
I have specified month=2014-12 in my where clause.
I have tried using spark sql and data frame API, and looks like both aren't pruning partitions.
Using Dataframe API
df.filter("month='2014-12'").show()
Using Spark SQL
sqlContext.sql("select name, price from products_parquet_151 where month = '2014-12'")
I have tried the above on versions 1.5.1, 1.6.1 and 2.0.0
Spark needs to load the partition metdata first in the driver to know whether the partition exists or not. Spark will query the directory to find existing partitions to know if it can prune the partition or not during the scanning of the data.
I've tested this on Spark 2.0 and you can see in the log messages.
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year on driver
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year/year=2007 on driver
This doesn't mean that we're scaning the files in each partition, but Spark will store the locations of the partitions for future queries on the table.
You can see the logs that it is actually passing in partition filters to prune the data:
16/10/14 17:23:48 TRACE ListingFileCatalog: Partition spec: PartitionSpec(StructType(StructField(year,IntegerType,true)),ArrayBuffer(PartitionDirectory([2012],s3a://mybucket/reddit_year/year=2012), PartitionDirectory([2010],s3a://mybucket/reddit_year/year=2010), ...PartitionDirectory([2015],s3a://mybucket/reddit_year/year=2015), PartitionDirectory([2011],s3a://mybucket/reddit_year/year=2011)))
16/10/14 17:23:48 INFO ListingFileCatalog: Selected 1 partitions out of 9, pruned 88.88888888888889% partitions.
You can see this in the logical plan if you run an explain(True) on your query:
spark.sql("select created_utc, score, name from reddit where year = '2014'").explain(True)
This will show you the plan and you can see that it is filtering at the bottom of the plan:
+- *BatchedScan parquet [created_utc#58,name#65,score#69L,year#74] Format: ParquetFormat, InputPaths: s3a://mybucket/reddit_year, PartitionFilters: [isnotnull(year#74), (cast(year#74 as double) = 2014.0)], PushedFilters: [], ReadSchema: struct<created_utc:string,name:string,score:bigint>
Spark has opportunities to improve its partition pruning when going via Hive; see SPARK-17179.
If you are just going direct to the object store, then the problem is that recursive directory operations against object stores are real performance killers. My colleagues and I have done work in the S3A client there HADOOP-11694 —and now need to follow it up with the changes to Spark to adopt the specific API calls we've been able to fix. For that though we need to make sure we are working with real datasets with real-word layouts, so don't optimise for specific examples/benchmarks.
For now, people should chose partition layouts which have shallow directory trees.