I have created a bucketed internal hive table through the dataframe writer saveAsTable api.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1)
.sortBy(col1)
.saveAsTable(hiveTableName);
Now I trigger 2 select queries via spark sql , one on the buckted column and the other on the non bucketed column but I don't see any difference in the execution time.
The queries are :
select * from t1 where col1='123' [t1 is bucketed by col1]
select * from t1 where col2='123' [col2 is not a bucketing column]
My questions are
How can I ascertain that during the query execution full table scan is happening or partial table scan on the relevant is happening?
Can I get any information from the DAG or the physical plan? I have seen both ,but I don't see any difference
This what I see in the physical plan
== Physical Plan ==
*(1) Project [col1#0, col2#1, col3#2, col4#3, col5#4, col6#5, col7#6, col8#7, col9#8, col10#9, col11#10, col12#11]
+- *(1) Filter (isnotnull(col2#1) && (col2#1 = 123))
+- *(1) FileScan parquet default.uk_geocrosswalk[col1#0,col2#1,col3#2,col4#3,col5#4,col6#5,col7#6,col8#7,col9#8,LSOA_MSOA_WEIGHT#9,col11#10,col12#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://url/a.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,123)], ReadSchema: struct
In the physical plan why is it doing a file scan ? Should not it be doing a HiveTableScan as the table has been created as a hive table?
Are there certain config parameters that I can use to tune my queries while using sparksql ?
I see each time I run the query for the first time in spark sql it takes considerably long time. Is there a way I can warmup the executors before the query gets executed ?
Parquet is columnar. Parquet is very fast from my experience. The columnar aspect may well explain the same performance - whether key or not, the data format physically is columnar.
It is a Hive table, but using Parquet and Bucketing and not accessible to Hive / Impala. As it is Parquet, Hive Table Scan is not appropriate. A Hive table can have many physical formats, text, Parquet, ORC.
You can see the filtering: PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,123)],
No warmup as such. You could .cache things but I have tested and seen tests whereby there is not much difference in terms of caching Parquet tables, but it depends on the test.
Related
How does pushedFilters work while using parquet files ?
Below are the two queries that I submitted in databricks.
HighVolume = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.where("originating_base_num in ('B02764','B02617')").count()
HighVolume_wofilter = spark.read.parquet("/FileStore/shared_uploads/highVolume/*.parquet") \
.count()
Physical Plan: Clearly mentions PushedFilter is not null for HighVolume dataframe.
HighVolume :
PushedFilters: [In(originating_base_num, [B02617,B02764])]
HighVolume_wofilter:
PushedFilters: []
But while checking Spark UI, I observed that spark is reading all the rows in both the cases ( ignoring the filters).
snippet:
HighVolume :
HighVolume_wofilter:
Can someone please help me understand that why instead of having the filters in Physical plan, all the rows are being read. ?
Thanks!
When you working with parquet there are few types of optimizations:
Skipping reading the not necessary files when table is partitioned, and there is a condition on that partition. In the explain it will be visible as PartitionFilters: [p#503 IN (1,2)] (p is the partition column). In this case Spark will read only files related to the given partitions - it's most efficient way for Parquet.
Skipping some data inside the files - Parquet format has internal statistics, such as, min/max per column, etc. that allows to skip reading blocks inside Parquet that doesn't contain your data. These filters will be shown as PushedFilters: [In(p, [1,2])]. But this may not be efficient if your data is inside min/max range, so Spark needs to read all blocks and filter on the Spark level.
P.S. Please take into account that Delta Lake format allows to access data more efficiently because of the data skipping, bloom filters, Z-ordering, etc.
I've used partitionColumn options to read a 300 million row table, hoping to achieve low memory/disk requirements for my ETL job (in Spark 3.0.1).
However, the explain plan shows at the start/leaf:
+- Exchange hashpartitioning[partitionCol#1, 200), true, [id=#201]
+- *(1) Scan JDBCRelation(table)[numPartitions=200] (partitionCol#1, time#2)...
I would have expected that shuffling was not necessary here, since the partitionCol was specified in the JDBC option.
There's a whole lot going on in the full plan, but every window operation partitions by partitionCol first and then other columns.
I've tried:
Ensuring my columns are declared not-null (since I saw Sort[partitionCol#1 ASC NULLS FIRST...] being injected and thought that might be an issue)
Checking dataframe partitioning: jdbcDF.rdd.partitioner is None (which seems to confirm it's not understood)
How to join two JDBC tables and avoid Exchange? leads to the Datasource v2 partitioning reporting interface (fixed in 2.3.1), but perhaps that doesn't extend to jdbc loading?
If I have a table created with multi-level partitions i.e. comprising of two columns (state, city) as follows:
state=CA,city=Anaheim
state=Texas,city=Houston
state=Texas,city=Dallas
state=Texas,city=Austin
state=CA,city=SanDiego
and if I run a select query like this:
select * from table_name where city=Houston
i.e. where the second partition column has been used, will it just scan the city=Houston partition in state=Texas? I am quite sure that this how Hive operates but keen to confirm the behavior in Spark. Also, will the behavior be any different if its executed in EMR's Spark?
If you are using hive to store the table, then definitely it will be able to do the partition pruning both for outer and inner partition. Hive keeps the metadata about the partition information about a table separately. Therefore, when the query comes for a particular partition, it is able to do the optimization.
You can actually test this behaviour using explain select * from table_name where city ='Houston';
However, if you are using spark to write the partitions in a nested structure, then I am not so sure. If the query needs to Traverse the whole directory structure, that will be expensive when the number of directories are huge.
Let's start with the case of loading data from a file path, vs the metastore. In this case, Spark will first do a recursive file listing to discover the nested partition folders and the files within them. The partition folders are then defined as fields used for partition pruning. So, in your case when you filter on any of the partition columns, Spark will select only the partitions that fulfill that predicate. You can confirm, by using the explain method on a query. Notice below that PartitionCount: 1:
scala> input1.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet [id#32,state#33,city#34] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 1, PartitionFilters: [isnotnull(city#34), (city#34 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>
Compare that to a query plan without any filters, where PartitionCount: 5:
scala> input1.explain()
== Physical Plan ==
*(1) FileScan parquet [id#55,state#56,city#57] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 5, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
Now the second case is when you load a partitioned table. In this case the partitions are managed by Hive so it saves you the expensive recursive file listing. When you filter on a partition column, again Spark will select only the relevant partitions. Notice the explain plan below:
scala> input2.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet default.data[id#39,state#40,city#41] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/tmp/data/state=Texas/city=Houston], PartitionCount: 1, PartitionFilters: [isnotnull(city#41), (city#41 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>
I am trying to see the difference between doing limits in Spark/AWS Glue
I tried using Spark SQL
spark.sql("SELECT * FROM flights LIMIT 10")
The explain looks something like:
CollectLimit 10
+- *FileScan parquet xxxxxx.flights[Id#31,...] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://xxxxxx/flights], PartitionCount: 14509, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
Then I tried using AWS Glue Data Catalog to see if its any faster
gdf = glueContext.create_dynamic_frame.from_catalog(database = "xxxxxx", table_name = "xxxxxx")
df = gdf.toDF()
df = df.limit(10)
df.explain(True)
df.show(10)
The explain looks like:
GlobalLimit 10
+- LocalLimit 10
+- LogicalRDD [Id#70, ...]
The first runs in 5 mins the 2nd runs in 4 mins, not that significant yet but I am thinking it appears that either querying the data catalog is faster or doing a limit in data frame is better than doing the limit in spark SQL?
Whats the difference between a collect vs global vs local limit? I am guessing local limit means it does limit locally then the driver will do the global limit to give the final result. But why is Spark SQL not also doing this optimization?
Does Spark reads all the underlying parquet files before doing any limit? Is there a way to tell spark to read until it gets just 10 rows in this example case?
SQL way , Programmatic Dataset creation -the control flow is the same in both cases, it goes through Spark SQL catalyst. In your case, when the query was run for the first time, it fetches metadata about the table from the metastore and caches it, in subsequent queries, it is reused, this might be the reason for the slowness in the first query.
There is no LogicalPlan node as CollectLimit, there is only CollectLimitExec physicalplan node. And limit is implemented as LocalLimit followed by GlobalLimit(link to code)
Spark performs limit incrementally.
It tries to retrieve the given number of rows using one partition.
If the number of rows is not satisfied, Spark then queries the next 4 partitions(determined by spark.sql.limit.scaleUpFactor, default 4), then 16 and so on until the limit is satisfied or the data is exhausted.
I have a sorted dataset with different columns and an id. The dataset is sorted (also verified with parquet-tools):
example:
file 1: ID 1-10
file 2: ID 10-12
file 3: ID 12-33
....
I also generated and wrote the _metadata and _common_metadata file. I tried on querying the (very big) dataset by using a filter
val mydata=spark.read.parquet("s3a://.../mylocation")
val result = mydata.filter(mydata("id") === 11)
result.explain(true)
the explain showed me:
== Parsed Logical Plan ==
Filter (id#14L = 11)
+- Relation[fieldA#12, fieldB#13,id#14L] parquet
== Analyzed Logical Plan ==
fieldA: int, fieldB: string, id: bigint
Filter (id#14L = 11)
+- Relation[fieldA#12, fieldB#13,id#14L] parquet
== Optimized Logical Plan ==
Filter (isnotnull(id#14L) && (id#14L = 11))
+- Relation[fieldA#12, fieldB#13,id#14L] parquet
== Physical Plan ==
*(1) Project [fieldA#12, fieldB#13,id#14L]
+- *(1) Filter (isnotnull(id#14L) && (id#14L = 11))
+- *(1) FileScan parquet [fieldA#12,fieldB#13,id#14L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://mybucket/path/to/data], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,11)], ReadSchema: struct<fieldA:int,fieldB:string,id:bigint>
I also enabled logging and could see that multiple files are read for getting the metadata per file. I have 10000 files in this "directory" in s3 so it takes a lot of time to retrieve all metadata from the files
Why is spark not getting the metadata from the _metadata file? Is there an option to enable this? I have already tried the following options:
spark.conf.set("parquet.summary.metadata.level","ALL")
spark.conf.set("parquet.filter.statistics.enabled","true")
spark.conf.set("parquet.filter.dictionary.enabled","true")
spark.conf.set("spark.sql.parquet.filterPushdown","true")
spark.conf.set("spark.sql.hive.convertMetastoreParquet","true")
spark.conf.set("spark.sql.parquet.respectSummaryFiles","true")
spark.conf.set("spark.sql.parquet.mergeSchema","false")
spark.conf.set("spark.sql.hive.convertMetastoreParquet.mergeSchema","false")
spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
Parquet summary files were deemed to be practically useless and write support for them was disabled in SPARK-15719. The reasoning mentioned in that JIRA suggests that summary files were only used for reading the schema and not other metadata like the min/max stats that could be useful for filtering. I can't confirm whether it's actually the case, but here is an excerpt from that reasoning:
Parquet summary files are not particular useful nowadays since
when schema merging is disabled, we assume schema of all Parquet part-files are identical, thus we can read the footer from any part-files.
when schema merging is enabled, we need to read footers of all files anyway to do the merge.
According to this excerpt, the need to read every file footer may also be caused by schema merging being enabled, although if the summary files are really only used for the schema, then I think file footers have to be read anyway.
If querying by ID is a frequent operation for you, you may consider partitioning your table by ID to avoid reading files unnecessarily.