How to query with partition pruning Apache Spark table over ADLS - apache-spark

I have data in ADLS in form of parquet tables "partitioned" by date:
gold/database/fact.parquet/YYYY=2022/YYYYMM=202209/YYYYMMDD=20220904/
Data is loaded into these partitions using:
file_path_gold = "/mnt/{}/{}/{}/{}/{}/{}_checkin.parquet".format(source_filesystem_name,
data_entity_name, yyyy, yyyyMM, yyyyMMdd, yyyyMMdd)
df_new.coalesce(1).write.format("parquet").mode("overwrite").option("header","true")\
.partitionBy("YYYY","YYYYMM","YYYYMMDD").save(file_path_gold)
I created Spark table on top of it using:
create table database.fact
using parquet
options (header true, inferSchema true, path 'dbfs:/mnt/gold/database/fact.parquet/YYYY=*/YYYYMM=*/YYYYMMDD=*')
I was not able to add anything about partitions to Create Table statement like
partition by (YYYY, YYYYMM, YYYYMMDD)
I was hoping that will be able to avoid reading all partitions when I am using Spark SQL.
However, I am not able to reference cols YYYY, YYYYMM, YYYYMMDD in SQL.
I have another col TxDate - that contains the date that I am looking for:
select count(*) from database.fact
where TxDate = '2022-12-21'
-- and YYYY = 2022
-- and YYYYMM = 202212
-- and YYYYMMDD = 20221221

Related

Reading Spark Dataframe from Partitioned Parquet data

I have parquet data stored on S3 and Athena table partitioned by id and date.
The parquet files are stored in
s3://bucket_name/table_name/id=x/date=y/
The parquet file contains the partition columns in them (id, date), because of which I am not able to read them using AWS Glue.
I would like to read the data in only a few partitions and hence I am making use of partition predicate as follows:
today = date.today()
yesterday = today - timedelta(days = 1)
predicate = "date = date '" + str(yesterday) +"'"
df =glueContext.create_dynamic_frame_from_catalog(database_name, table_name, push_down_predicate= predicate)
However, since the files already contain the partition columns, I am getting the below error:
AnalysisException: Found duplicate column(s) in the data schema and
the partition schema: id, date
Is there a way I can read data from only a few partitions like this? Can I somehow read the data by ignoring id and date columns?
Any sort of help is appreciated :)
Concerning your first question 'Is there a way I can read data from only a few partitions like this?':
You don't need to use predicate in my opinion - the beauty of having partitioned parquet files is that Spark will push any filter which is applied along those partitions down to the file scanning phase. Meaning that Spark is able to skip certain groups by just reading the metadata of the parquet files.
Have a look at the physical execution plan once you execute a df = spark.read()and df.filter(col("date") == '2022-07-19').
You should find something along the lines of
+- FileScan parquet [idxx, ... PushedFilters: [IsNotNull(date), EqualTo(date, 2022-07-19)..
Concerning whether you can read the data by ignoring id and date columns: You can potentially add multiple parquet paths to the read function at the bottom level - which would ignore the date/id columns alltogether (I don't know why you would do that though if you need to filter on them):
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id=55/",
"file:///your/path/date=2022-07-19/id=40/")
# Shorter solution:
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id={55, 40}/*")

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pattern after writing to delta table.
My DataFrame Output column holds the value in this format 2022-05-13 17:52:09.771,
But After writing it to the Table, The column value is getting populated as
2022-05-13T17:52:09.771+0000
I am using below function to generate this Dataframe output
val pretsUTCText = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
val tsUTCText: String = pretsUTCTextNew.format(ts)
val tsUTCCol : Column = lit(tsUTCText)
val df = df2.withColumn(to_timestamp(timestampConverter.tsUTCCol,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
The Dataframe output is returning 2022-05-13 17:52:09.771 as TIMESTAMP pattern.
But After writing it to Delta Table I see the same value is getting populated as 2022-05-13T17:52:09.771+0000
Thanks in Advance. I could not find any solution.
I have just found the same behaviour on Databricks as you, and it behaves differently than the Databricks document. It seems after some versions Databricks show timezone as a default so you see additional +0000. I think you can use date_format function when you populate data if you don't want it. Also, I think you don't need 'Z' in format text as it is for timezone. See the screenshot below.

How to read some partitions of delta table?

I have a partitioned delta table stored in ADLS (partitoned on date column).
How to read only that data which is of the past one year, i.e data is something 2020-**-**?
You just need to read table and filter out the data you need - Spark will perform predicate pushdown & will read only data from the matching partitions. It's simple as (assuming that column is called date):
df = spark.read.format("delta").load("your-table-root-path") \
.filter("date >= '2020-01-01' and date <= '2020-12-31'")
Load in partitioned raw table based on date
raw_df = spark.read.format('delta').load(raw_data_path + 'tablename/partitionname=specifieddate')
This is the code I use usually to retrieve a certain partition from ADLS

Same query resulting in different outputs in Hive vs Spark

Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?
I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.

Write PySpark dataframe into Partitioned Hive table

I am learning Spark. I have a dataframe ts of below structure.
ts.show()
+--------------------+--------------------+
| UTC| PST|
+--------------------+--------------------+
|2020-11-04 02:24:...|2020-11-03 18:24:...|
+--------------------+--------------------+
I need to insert ts into Partitioned table in Hive with below structure,
spark.sql(""" create table db.ts_part
(
UTC timestamp,
PST timestamp
)
PARTITIONED BY( bkup_dt DATE )
STORED AS ORC""")
How do i dynamically pass system run date in the insert statement so that it gets partitioned on bkup_dt in table based on date.
I tried something like this code. But it didn't work
ts.write.partitionBy(current_date()).insertInto("db.ts_part",overwrite=False)
How should I do it? Can someone please help!
Try by creating new column with current_date() and then write as partitioned by hive table.
Example:
df.\
withColumn("bkup_dt",current_date()).\
write.\
partitionBy("bkup_dt").\
insertInto("db.ts_part",overwrite=False)
UPDATE:
try by creating temp view then run insert statement.
df.createOrReplaceTempView("tmp")
sql("insert into table <table_name> partition (bkup_dt) select *,current_date bkup_dt from tmp")

Resources