Lets consider an orc table on hive with a partition on a dt_month column and it contains all rows for the days in a month (txn_dt).
Partition pruning will work when I when I introduce a where clause directly on the dt_month like below.
df = spark.table("table")
df.where("dt_month = '2018-01-01'")
But is there a possibility for me to gather statistics on a partition level and prune partitions while filtering on the txn_dt (which is the column that dt_month is derived from) because there are some transitive properties that this holds towards the partition column?
df = spark.table("table")
df.where("txn_dt = '2018-01-01'")
Can we make this query not run through the whole table and rely on orc indices but only the 2018-01-01 partition and then use the orc index?
Related
I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.
I have a fact table which is 10Tb (Parquet) which contains 100+ columns. When I have created another table with just 10 columns from the fact table and size is 2TB.
I was expecting the size should be in some GBs because I am storing just few (10) columns?
My question is when we have more columns does Parquet format stores in more efficient manner?
Parquet is a column based storage. Say if I have a table with fields userId, name, address, state, phone no.
In non-parquet storage If I do a select * where state = "TN" it will go through every record in my table (i.e all the columns of each row) and output the records that match my where condition. However in parquet format all the columns are stored together so I don't need to go through all the other columns. The same select query will directly go to column 'state' and output records that match the where condition. Parquet is good for faster retrieval (to get results faster). It doesn't matter how many columns are present in total.
Parquet uses snappy compression. Since all the columns are stored together it makes compression very effective.
If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.
I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")
We have a DataFrame with Transaction Date column which is timestamp.
When we write the DF as ORC files we applied the partition logic on Transaction Date value ( not timestamp only date value), we created a separate field only for applying partition on that field.
If we read the ORC files again with where condition as Transaction Date(timestamp) value, will it prune the partitions?
No. You need to reference the "separate" field appropriately. It stands to reason and is a fundamental DB rule wrt partition pruning.