If I have a table created with multi-level partitions i.e. comprising of two columns (state, city) as follows:
state=CA,city=Anaheim
state=Texas,city=Houston
state=Texas,city=Dallas
state=Texas,city=Austin
state=CA,city=SanDiego
and if I run a select query like this:
select * from table_name where city=Houston
i.e. where the second partition column has been used, will it just scan the city=Houston partition in state=Texas? I am quite sure that this how Hive operates but keen to confirm the behavior in Spark. Also, will the behavior be any different if its executed in EMR's Spark?
If you are using hive to store the table, then definitely it will be able to do the partition pruning both for outer and inner partition. Hive keeps the metadata about the partition information about a table separately. Therefore, when the query comes for a particular partition, it is able to do the optimization.
You can actually test this behaviour using explain select * from table_name where city ='Houston';
However, if you are using spark to write the partitions in a nested structure, then I am not so sure. If the query needs to Traverse the whole directory structure, that will be expensive when the number of directories are huge.
Let's start with the case of loading data from a file path, vs the metastore. In this case, Spark will first do a recursive file listing to discover the nested partition folders and the files within them. The partition folders are then defined as fields used for partition pruning. So, in your case when you filter on any of the partition columns, Spark will select only the partitions that fulfill that predicate. You can confirm, by using the explain method on a query. Notice below that PartitionCount: 1:
scala> input1.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet [id#32,state#33,city#34] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 1, PartitionFilters: [isnotnull(city#34), (city#34 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>
Compare that to a query plan without any filters, where PartitionCount: 5:
scala> input1.explain()
== Physical Plan ==
*(1) FileScan parquet [id#55,state#56,city#57] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 5, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
Now the second case is when you load a partitioned table. In this case the partitions are managed by Hive so it saves you the expensive recursive file listing. When you filter on a partition column, again Spark will select only the relevant partitions. Notice the explain plan below:
scala> input2.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet default.data[id#39,state#40,city#41] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/tmp/data/state=Texas/city=Houston], PartitionCount: 1, PartitionFilters: [isnotnull(city#41), (city#41 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>
Related
I am trying to understand hash shuffle in Spark. I am reading this article
Hash Shuffle:
Each mapper task creates separate file for each separate reducer, resulting in M * R total files on the cluster, where M is the number of “mappers” and R is the number of “reducers”. With high amount of mappers and reducers this causes big problems, both with the output buffer size, amount of open files on the filesystem, speed of creating and dropping all these files.
The logic of this shuffler is pretty dumb: it calculates the amount of “reducers” as the amount of partitions on the “reduce” side
Can you help me understand the emboldened part? How does it know the amount of partitions on the reduce side or, what does "amount of partitions on the reduce side" even mean? Is it equal to spark.sql.shuffle.partitions? If it is indeed equal to that, then what is even there to calculate? A very small example would be very helpful.
spark.sql.shuffle.partitions is just the default used when number of partitions for a shuffle isn't set explicitly. So the "calculation", at a minimum, would involve a check whether specific number of partitions was requested or should Spark use the default.
Quick example:
scala> df.repartition(400,col("key")).groupBy("key").avg("value").explain()
== Physical Plan ==
*(2) HashAggregate(keys=[key#178], functions=[avg(value#164)])
+- *(2) HashAggregate(keys=[key#178], functions=[partial_avg(value#164)])
+- Exchange hashpartitioning(key#178, 400) <<<<< shuffle across increased number of partitions
+- *(1) Project [key#178, value#164]
+- *(1) FileScan parquet adb.atable[value#164,key#178,othercolumns...] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[hdfs://ns1/hive/adb/atable/key=123..., PartitionCount: 3393, PartitionFilters: [isnotnull(key#178), (cast(key#178 as string) > 100)], PushedFilters: [], ReadSchema: struct<value:double,othercolumns...>
scala>
In Spark 3 and up, Adaptive Query Engine could also interject and revise that number, in attempts to optimize execution by coalescing, preserving (e.g. ENSURE_REQUIREMENTS) or increasing partitions.
EDIT: A side note -- your article is quite old (2015 was ages ago :)) and talks about pre-SparkSQL/pre-dataframe times. I'd try to find something more relevant.
EDIT 2: ...But even there, in the comments section, author rightfully says: In fact, here the question is more general. For most of the transformations in Spark you can manually specify the desired amount of output partitions, and this would be your amount of “reducers”...
If I am understanding the documentation correctly, partitioning a dataframe, vs partitioning a hive or other on-disk table, seem to be different. For on-disk storage, partitioning by, say, date creates a set of partitions for each date which occurs in my dataset. This seems useful; if I query records for a given date, every node in my cluster processes only partitions corresponding to the date I want.
Dataframe.repartition, on the other hand, creates one partition for each date which occurs in my dataset. If I search for records from a specific date, they will all be found in a single partition and thus all processed by a single node.
Is this right? If so, what is the use case? What is the way to get the speed advantage of on-disk partitioning schemes in the context of a dataframe?
For what it's worth, I need the advantage after I do an aggregation of on-disk data, so the on-disk partitioning doesn't necessarily help me even with delayed execution.
In your example, Spark will be able to recover very quickly all the records linked to that date. That's an improvement.
In the following piece of code, you can see that the filter has been categorized as partition filter.
inputRdd = sc.parallelize([("fish", 1), ("cats",2), ("dogs",3)])
schema = StructType([StructField("animals", StringType(), True),
StructField("ID", IntegerType(), True)])
my_dataframe = inputRdd.toDF(schema)
my_dataframe.write.partitionBy('animals').parquet("home")
sqlContext.read.parquet('home').filter(col('animals') == 'fish').explain()
== Physical Plan ==
*(1) FileScan parquet [ID#35,animals#36] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/home], PartitionCount: 1, PartitionFilters: [isnotnull(animals#36), (animals#36 = fish)], PushedFilters: [], ReadSchema: struct<ID:int>
For a deeper insight, you may want to have a look at this.
I am actually not sure about your other question. You are probably right, in my example df.rdd.getNumPartitions() gives 1. And with one partition performances are not so great (but you have already read from the disk at this point). For the following steps calling repartition(n) will fix the problem but it may be quite costly.
Another possible improvement is related to joining two data frames that share the same partitioning (with the join keys being the partition columns), you will avoid a lot of shuffles in the join phase.
I am trying to see the difference between doing limits in Spark/AWS Glue
I tried using Spark SQL
spark.sql("SELECT * FROM flights LIMIT 10")
The explain looks something like:
CollectLimit 10
+- *FileScan parquet xxxxxx.flights[Id#31,...] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://xxxxxx/flights], PartitionCount: 14509, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<...
Then I tried using AWS Glue Data Catalog to see if its any faster
gdf = glueContext.create_dynamic_frame.from_catalog(database = "xxxxxx", table_name = "xxxxxx")
df = gdf.toDF()
df = df.limit(10)
df.explain(True)
df.show(10)
The explain looks like:
GlobalLimit 10
+- LocalLimit 10
+- LogicalRDD [Id#70, ...]
The first runs in 5 mins the 2nd runs in 4 mins, not that significant yet but I am thinking it appears that either querying the data catalog is faster or doing a limit in data frame is better than doing the limit in spark SQL?
Whats the difference between a collect vs global vs local limit? I am guessing local limit means it does limit locally then the driver will do the global limit to give the final result. But why is Spark SQL not also doing this optimization?
Does Spark reads all the underlying parquet files before doing any limit? Is there a way to tell spark to read until it gets just 10 rows in this example case?
SQL way , Programmatic Dataset creation -the control flow is the same in both cases, it goes through Spark SQL catalyst. In your case, when the query was run for the first time, it fetches metadata about the table from the metastore and caches it, in subsequent queries, it is reused, this might be the reason for the slowness in the first query.
There is no LogicalPlan node as CollectLimit, there is only CollectLimitExec physicalplan node. And limit is implemented as LocalLimit followed by GlobalLimit(link to code)
Spark performs limit incrementally.
It tries to retrieve the given number of rows using one partition.
If the number of rows is not satisfied, Spark then queries the next 4 partitions(determined by spark.sql.limit.scaleUpFactor, default 4), then 16 and so on until the limit is satisfied or the data is exhausted.
I have a sorted dataset with different columns and an id. The dataset is sorted (also verified with parquet-tools):
example:
file 1: ID 1-10
file 2: ID 10-12
file 3: ID 12-33
....
I also generated and wrote the _metadata and _common_metadata file. I tried on querying the (very big) dataset by using a filter
val mydata=spark.read.parquet("s3a://.../mylocation")
val result = mydata.filter(mydata("id") === 11)
result.explain(true)
the explain showed me:
== Parsed Logical Plan ==
Filter (id#14L = 11)
+- Relation[fieldA#12, fieldB#13,id#14L] parquet
== Analyzed Logical Plan ==
fieldA: int, fieldB: string, id: bigint
Filter (id#14L = 11)
+- Relation[fieldA#12, fieldB#13,id#14L] parquet
== Optimized Logical Plan ==
Filter (isnotnull(id#14L) && (id#14L = 11))
+- Relation[fieldA#12, fieldB#13,id#14L] parquet
== Physical Plan ==
*(1) Project [fieldA#12, fieldB#13,id#14L]
+- *(1) Filter (isnotnull(id#14L) && (id#14L = 11))
+- *(1) FileScan parquet [fieldA#12,fieldB#13,id#14L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://mybucket/path/to/data], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,11)], ReadSchema: struct<fieldA:int,fieldB:string,id:bigint>
I also enabled logging and could see that multiple files are read for getting the metadata per file. I have 10000 files in this "directory" in s3 so it takes a lot of time to retrieve all metadata from the files
Why is spark not getting the metadata from the _metadata file? Is there an option to enable this? I have already tried the following options:
spark.conf.set("parquet.summary.metadata.level","ALL")
spark.conf.set("parquet.filter.statistics.enabled","true")
spark.conf.set("parquet.filter.dictionary.enabled","true")
spark.conf.set("spark.sql.parquet.filterPushdown","true")
spark.conf.set("spark.sql.hive.convertMetastoreParquet","true")
spark.conf.set("spark.sql.parquet.respectSummaryFiles","true")
spark.conf.set("spark.sql.parquet.mergeSchema","false")
spark.conf.set("spark.sql.hive.convertMetastoreParquet.mergeSchema","false")
spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
Parquet summary files were deemed to be practically useless and write support for them was disabled in SPARK-15719. The reasoning mentioned in that JIRA suggests that summary files were only used for reading the schema and not other metadata like the min/max stats that could be useful for filtering. I can't confirm whether it's actually the case, but here is an excerpt from that reasoning:
Parquet summary files are not particular useful nowadays since
when schema merging is disabled, we assume schema of all Parquet part-files are identical, thus we can read the footer from any part-files.
when schema merging is enabled, we need to read footers of all files anyway to do the merge.
According to this excerpt, the need to read every file footer may also be caused by schema merging being enabled, although if the summary files are really only used for the schema, then I think file footers have to be read anyway.
If querying by ID is a frequent operation for you, you may consider partitioning your table by ID to avoid reading files unnecessarily.
I have created a bucketed internal hive table through the dataframe writer saveAsTable api.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1)
.sortBy(col1)
.saveAsTable(hiveTableName);
Now I trigger 2 select queries via spark sql , one on the buckted column and the other on the non bucketed column but I don't see any difference in the execution time.
The queries are :
select * from t1 where col1='123' [t1 is bucketed by col1]
select * from t1 where col2='123' [col2 is not a bucketing column]
My questions are
How can I ascertain that during the query execution full table scan is happening or partial table scan on the relevant is happening?
Can I get any information from the DAG or the physical plan? I have seen both ,but I don't see any difference
This what I see in the physical plan
== Physical Plan ==
*(1) Project [col1#0, col2#1, col3#2, col4#3, col5#4, col6#5, col7#6, col8#7, col9#8, col10#9, col11#10, col12#11]
+- *(1) Filter (isnotnull(col2#1) && (col2#1 = 123))
+- *(1) FileScan parquet default.uk_geocrosswalk[col1#0,col2#1,col3#2,col4#3,col5#4,col6#5,col7#6,col8#7,col9#8,LSOA_MSOA_WEIGHT#9,col11#10,col12#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://url/a.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,123)], ReadSchema: struct
In the physical plan why is it doing a file scan ? Should not it be doing a HiveTableScan as the table has been created as a hive table?
Are there certain config parameters that I can use to tune my queries while using sparksql ?
I see each time I run the query for the first time in spark sql it takes considerably long time. Is there a way I can warmup the executors before the query gets executed ?
Parquet is columnar. Parquet is very fast from my experience. The columnar aspect may well explain the same performance - whether key or not, the data format physically is columnar.
It is a Hive table, but using Parquet and Bucketing and not accessible to Hive / Impala. As it is Parquet, Hive Table Scan is not appropriate. A Hive table can have many physical formats, text, Parquet, ORC.
You can see the filtering: PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,123)],
No warmup as such. You could .cache things but I have tested and seen tests whereby there is not much difference in terms of caching Parquet tables, but it depends on the test.