Reading Spark Dataframe from Partitioned Parquet data - apache-spark

I have parquet data stored on S3 and Athena table partitioned by id and date.
The parquet files are stored in
s3://bucket_name/table_name/id=x/date=y/
The parquet file contains the partition columns in them (id, date), because of which I am not able to read them using AWS Glue.
I would like to read the data in only a few partitions and hence I am making use of partition predicate as follows:
today = date.today()
yesterday = today - timedelta(days = 1)
predicate = "date = date '" + str(yesterday) +"'"
df =glueContext.create_dynamic_frame_from_catalog(database_name, table_name, push_down_predicate= predicate)
However, since the files already contain the partition columns, I am getting the below error:
AnalysisException: Found duplicate column(s) in the data schema and
the partition schema: id, date
Is there a way I can read data from only a few partitions like this? Can I somehow read the data by ignoring id and date columns?
Any sort of help is appreciated :)

Concerning your first question 'Is there a way I can read data from only a few partitions like this?':
You don't need to use predicate in my opinion - the beauty of having partitioned parquet files is that Spark will push any filter which is applied along those partitions down to the file scanning phase. Meaning that Spark is able to skip certain groups by just reading the metadata of the parquet files.
Have a look at the physical execution plan once you execute a df = spark.read()and df.filter(col("date") == '2022-07-19').
You should find something along the lines of
+- FileScan parquet [idxx, ... PushedFilters: [IsNotNull(date), EqualTo(date, 2022-07-19)..
Concerning whether you can read the data by ignoring id and date columns: You can potentially add multiple parquet paths to the read function at the bottom level - which would ignore the date/id columns alltogether (I don't know why you would do that though if you need to filter on them):
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id=55/",
"file:///your/path/date=2022-07-19/id=40/")
# Shorter solution:
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id={55, 40}/*")

Related

How to query with partition pruning Apache Spark table over ADLS

I have data in ADLS in form of parquet tables "partitioned" by date:
gold/database/fact.parquet/YYYY=2022/YYYYMM=202209/YYYYMMDD=20220904/
Data is loaded into these partitions using:
file_path_gold = "/mnt/{}/{}/{}/{}/{}/{}_checkin.parquet".format(source_filesystem_name,
data_entity_name, yyyy, yyyyMM, yyyyMMdd, yyyyMMdd)
df_new.coalesce(1).write.format("parquet").mode("overwrite").option("header","true")\
.partitionBy("YYYY","YYYYMM","YYYYMMDD").save(file_path_gold)
I created Spark table on top of it using:
create table database.fact
using parquet
options (header true, inferSchema true, path 'dbfs:/mnt/gold/database/fact.parquet/YYYY=*/YYYYMM=*/YYYYMMDD=*')
I was not able to add anything about partitions to Create Table statement like
partition by (YYYY, YYYYMM, YYYYMMDD)
I was hoping that will be able to avoid reading all partitions when I am using Spark SQL.
However, I am not able to reference cols YYYY, YYYYMM, YYYYMMDD in SQL.
I have another col TxDate - that contains the date that I am looking for:
select count(*) from database.fact
where TxDate = '2022-12-21'
-- and YYYY = 2022
-- and YYYYMM = 202212
-- and YYYYMMDD = 20221221

Try to avoid shuffle by manual control of table read per executor

I have:
really huge (let's say 100s if Tb) Iceberg table B which is partitioned by main_col, truncate[N, stamp]
small table S with columns main_col, stamp_as_key
I want to get a dataframe (actually table) with logic:
b = spark.read.table(B)
s = spark.read.table(S)
df = b.join(F.broadcast(s), (b.main_col == s.main_col) & (s.stamp_as_key - W0 <= b.stamp <= s.stamp_as_key + W0))
df = df.groupby('main_col', 'stamp_as_key').agg(make_some_transformations)
I want to avoid shuffle when reading B table. Iceberg has some meta tables about all parquet files in table and its content. What is possible to do:
read only metainfo table of B table
join it with S table
repartition by expected columns
collect s3 paths of real B data
read these files from executors independently.
Is there a better way to make this work? Also I can change the schema of B table if needed. But main_col should stay as a first paritioner.
One more question: suppose I have such dataframe and I saved it as a table. I need effectively join such tables. Am I correct that it is also impossible to do without shuffle with classic spark code?

How to specify nested partitions in merge query while trying to merge incremental data with a base table?

I am trying to merge a dataframe that contains incremental data into my base table as per the databricks documentation.
base_delta.alias('base') \
.merge(source=kafka_df.alias('inc'),
condition='base.key1=ic.key1 and base.key2=inc.key2') \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
The above operation is working fine but it takes lot time as expected since there are lot of unwanted partitions that are being scanned.
I came across a databricks documentation here, a merge query with partitions specified in it.
Code from that link:
spark.sql(s"""
|MERGE INTO $targetTableName
|USING $updatesTableName
|ON $targetTableName.par IN (1,0) AND $targetTableName.id = $updatesTableName.id
|WHEN MATCHED THEN
| UPDATE SET $targetTableName.ts = $updatesTableName.ts
|WHEN NOT MATCHED THEN
| INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)
""".stripMargin)
The partitions are specified in the IN condition as 1,2,3... But in my case, the table is first partitioned on COUNTRY values USA, UK, NL, FR, IND and then every country has partition on YYYY-MM Ex: 2020-01, 2020-02, 2020-03
How can I specify the partition values if I have nested structure like I mentioned above ?
Any help is massively appreciated.
Yes, you can do that & it's really recommended, because Delta Lake needs to scan all the data that are matching to the ON condition. If you're using Python API, you just need to use correct SQL expression as condition, and you can put restrictions on the partition columns into it, something like this in your case (date is the column from the update date):
base.country = 'country1' and base.date = inc.date and
base.key1=inc.key1 and base.key2=inc.key2
if you have multiple countries, then you can use IN ('country1', 'country2'), but it would be easier to have country inside your update dataframe and match using base.country = inc.country

How to read some partitions of delta table?

I have a partitioned delta table stored in ADLS (partitoned on date column).
How to read only that data which is of the past one year, i.e data is something 2020-**-**?
You just need to read table and filter out the data you need - Spark will perform predicate pushdown & will read only data from the matching partitions. It's simple as (assuming that column is called date):
df = spark.read.format("delta").load("your-table-root-path") \
.filter("date >= '2020-01-01' and date <= '2020-12-31'")
Load in partitioned raw table based on date
raw_df = spark.read.format('delta').load(raw_data_path + 'tablename/partitionname=specifieddate')
This is the code I use usually to retrieve a certain partition from ADLS

Watermarking in Spark Structured Streaming 2.3.0

I read data from Kafka in Spark Structured Streaming 2.3.0. The data contains information about some teachers, there is teacherId, teacherName and teacherGroupsIds. TeacherGroupsIds is an array column which contains ids of the group. In my task I have to map the column with group ids to column containing information about group names([1,2,3] => [Suns,Books,Flowers]). The names and ids are stored in HBase and can change everyday. Later I have to send the data to another Kafka topic.
So, I read data from two sources - Kafka and HBase. I read data from HBase using shc library.
First, I explode the array column (group ids), later I join with the data from HBase.
In next step I would like to aggregate the data using teacherId. But this operation is not supported in Append Mode which I use.
I have tried with watermarking, but at the moment it doesn't work. I added a new column with timestamp and I would group by this column.
Dataset<Row> inputDataset = //reading from Kafka
Dataset<Row> explodedDataset = // explode function applied and join with HBase
Dataset<Row> outputDataset = explodedDataset
.withColumn("eventTime", lit(current_timestamp()))
.withWatermark("eventTime", "2 minutes")
.groupBy(window(col("eventTime"), "5 seconds"), col("teacherId"))
.agg(collect_list(col("groupname")));
Actual results show empty dataframe at the output. There is not any row.
The problem is current_timestamp().
current_timestamp returns the timestamp in that moment, so, if you create a dataframe with this column and print the result, you print the current timestamp, but if you process the df and you print the same column, you print the new timestamp.
This solution works locally, but sometimes in a distributed system it fails because the workers when receiving the order to print the data, this data is already outside the timestamp range.

Resources