We have a DataFrame with Transaction Date column which is timestamp.
When we write the DF as ORC files we applied the partition logic on Transaction Date value ( not timestamp only date value), we created a separate field only for applying partition on that field.
If we read the ORC files again with where condition as Transaction Date(timestamp) value, will it prune the partitions?
No. You need to reference the "separate" field appropriately. It stands to reason and is a fundamental DB rule wrt partition pruning.
Related
I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?
Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.
I have a fact table which is 10Tb (Parquet) which contains 100+ columns. When I have created another table with just 10 columns from the fact table and size is 2TB.
I was expecting the size should be in some GBs because I am storing just few (10) columns?
My question is when we have more columns does Parquet format stores in more efficient manner?
Parquet is a column based storage. Say if I have a table with fields userId, name, address, state, phone no.
In non-parquet storage If I do a select * where state = "TN" it will go through every record in my table (i.e all the columns of each row) and output the records that match my where condition. However in parquet format all the columns are stored together so I don't need to go through all the other columns. The same select query will directly go to column 'state' and output records that match the where condition. Parquet is good for faster retrieval (to get results faster). It doesn't matter how many columns are present in total.
Parquet uses snappy compression. Since all the columns are stored together it makes compression very effective.
If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.
The spark data saving operation is quite slow if:
the dataframe df partitioned by date (year, month, day), df contains data from exactly one day, say 2019-02-14.
If I save the df by:
df.write.partitionBy("year", "month", "day").parquet("/path/")
It will be slow due to all data belong to one partition, which is processed by one task (??).
If saving df with explicit partition path:
df.write.parquet("/path/year=2019/month=02/day=14/")
It works well, but it will create the _metadata, _common_metadata, _SUCCESS files in "/path/year=2019/month=02/day=14/"
in stead of "/path/". Drop partition columns are required to keep same fields as using method partitionBy.
So, how to speed up saving data with only one partition without changing metadata files location, which can be updated in each OP.
Is it safe to use explicit partition path instead of using partitionBy?
Lets consider an orc table on hive with a partition on a dt_month column and it contains all rows for the days in a month (txn_dt).
Partition pruning will work when I when I introduce a where clause directly on the dt_month like below.
df = spark.table("table")
df.where("dt_month = '2018-01-01'")
But is there a possibility for me to gather statistics on a partition level and prune partitions while filtering on the txn_dt (which is the column that dt_month is derived from) because there are some transitive properties that this holds towards the partition column?
df = spark.table("table")
df.where("txn_dt = '2018-01-01'")
Can we make this query not run through the whole table and rely on orc indices but only the 2018-01-01 partition and then use the orc index?