I have a partitioned delta table stored in ADLS (partitoned on date column).
How to read only that data which is of the past one year, i.e data is something 2020-**-**?
You just need to read table and filter out the data you need - Spark will perform predicate pushdown & will read only data from the matching partitions. It's simple as (assuming that column is called date):
df = spark.read.format("delta").load("your-table-root-path") \
.filter("date >= '2020-01-01' and date <= '2020-12-31'")
Load in partitioned raw table based on date
raw_df = spark.read.format('delta').load(raw_data_path + 'tablename/partitionname=specifieddate')
This is the code I use usually to retrieve a certain partition from ADLS
Related
I have data in ADLS in form of parquet tables "partitioned" by date:
gold/database/fact.parquet/YYYY=2022/YYYYMM=202209/YYYYMMDD=20220904/
Data is loaded into these partitions using:
file_path_gold = "/mnt/{}/{}/{}/{}/{}/{}_checkin.parquet".format(source_filesystem_name,
data_entity_name, yyyy, yyyyMM, yyyyMMdd, yyyyMMdd)
df_new.coalesce(1).write.format("parquet").mode("overwrite").option("header","true")\
.partitionBy("YYYY","YYYYMM","YYYYMMDD").save(file_path_gold)
I created Spark table on top of it using:
create table database.fact
using parquet
options (header true, inferSchema true, path 'dbfs:/mnt/gold/database/fact.parquet/YYYY=*/YYYYMM=*/YYYYMMDD=*')
I was not able to add anything about partitions to Create Table statement like
partition by (YYYY, YYYYMM, YYYYMMDD)
I was hoping that will be able to avoid reading all partitions when I am using Spark SQL.
However, I am not able to reference cols YYYY, YYYYMM, YYYYMMDD in SQL.
I have another col TxDate - that contains the date that I am looking for:
select count(*) from database.fact
where TxDate = '2022-12-21'
-- and YYYY = 2022
-- and YYYYMM = 202212
-- and YYYYMMDD = 20221221
I have parquet data stored on S3 and Athena table partitioned by id and date.
The parquet files are stored in
s3://bucket_name/table_name/id=x/date=y/
The parquet file contains the partition columns in them (id, date), because of which I am not able to read them using AWS Glue.
I would like to read the data in only a few partitions and hence I am making use of partition predicate as follows:
today = date.today()
yesterday = today - timedelta(days = 1)
predicate = "date = date '" + str(yesterday) +"'"
df =glueContext.create_dynamic_frame_from_catalog(database_name, table_name, push_down_predicate= predicate)
However, since the files already contain the partition columns, I am getting the below error:
AnalysisException: Found duplicate column(s) in the data schema and
the partition schema: id, date
Is there a way I can read data from only a few partitions like this? Can I somehow read the data by ignoring id and date columns?
Any sort of help is appreciated :)
Concerning your first question 'Is there a way I can read data from only a few partitions like this?':
You don't need to use predicate in my opinion - the beauty of having partitioned parquet files is that Spark will push any filter which is applied along those partitions down to the file scanning phase. Meaning that Spark is able to skip certain groups by just reading the metadata of the parquet files.
Have a look at the physical execution plan once you execute a df = spark.read()and df.filter(col("date") == '2022-07-19').
You should find something along the lines of
+- FileScan parquet [idxx, ... PushedFilters: [IsNotNull(date), EqualTo(date, 2022-07-19)..
Concerning whether you can read the data by ignoring id and date columns: You can potentially add multiple parquet paths to the read function at the bottom level - which would ignore the date/id columns alltogether (I don't know why you would do that though if you need to filter on them):
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id=55/",
"file:///your/path/date=2022-07-19/id=40/")
# Shorter solution:
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id={55, 40}/*")
I am trying to merge a dataframe that contains incremental data into my base table as per the databricks documentation.
base_delta.alias('base') \
.merge(source=kafka_df.alias('inc'),
condition='base.key1=ic.key1 and base.key2=inc.key2') \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
The above operation is working fine but it takes lot time as expected since there are lot of unwanted partitions that are being scanned.
I came across a databricks documentation here, a merge query with partitions specified in it.
Code from that link:
spark.sql(s"""
|MERGE INTO $targetTableName
|USING $updatesTableName
|ON $targetTableName.par IN (1,0) AND $targetTableName.id = $updatesTableName.id
|WHEN MATCHED THEN
| UPDATE SET $targetTableName.ts = $updatesTableName.ts
|WHEN NOT MATCHED THEN
| INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)
""".stripMargin)
The partitions are specified in the IN condition as 1,2,3... But in my case, the table is first partitioned on COUNTRY values USA, UK, NL, FR, IND and then every country has partition on YYYY-MM Ex: 2020-01, 2020-02, 2020-03
How can I specify the partition values if I have nested structure like I mentioned above ?
Any help is massively appreciated.
Yes, you can do that & it's really recommended, because Delta Lake needs to scan all the data that are matching to the ON condition. If you're using Python API, you just need to use correct SQL expression as condition, and you can put restrictions on the partition columns into it, something like this in your case (date is the column from the update date):
base.country = 'country1' and base.date = inc.date and
base.key1=inc.key1 and base.key2=inc.key2
if you have multiple countries, then you can use IN ('country1', 'country2'), but it would be easier to have country inside your update dataframe and match using base.country = inc.country
I am trying to identify and insert only the delta records to the target hive table from pyspark program. I am using left anti join on ID columns and it's able to identify the new records successfully. But I could notice that the total number of delta records is not the same as the difference between table record count before load and afterload.
delta_df = src_df.join(tgt_df, src_df.JOIN_HASH == tgt_df.JOIN_HASH,how='leftanti')\
.select(src_df.columns).drop("JOIN_HASH")
delta_df.count() #giving out correct delta count
delta_df.write.mode("append").format("hive").option("compression","snappy").saveAsTable(hivetable)
But if I could see delta_df.count() is not the same as count( * ) from hivetable after writting data - count(*) from hivetable before writting data. The difference is always coming higher compared to the delta count.
I have a unique timestamp column for each load in the source, and to my surprise, the count of records in the target for the current load(grouping by unique timestamp) is less than the delta count.
I am not able to identify the issue here, do I have to write the df.write in some other way?
It was a problem with the line delimiter. When the table is created with spark.write, in SERDEPROPERTIES there is no line.delim specified and column values with * were getting split into multiple rows.
Now I added the below SERDEPROPERTIES and it stores the data correctly.
'line.delim'='\n'
I read data from Kafka in Spark Structured Streaming 2.3.0. The data contains information about some teachers, there is teacherId, teacherName and teacherGroupsIds. TeacherGroupsIds is an array column which contains ids of the group. In my task I have to map the column with group ids to column containing information about group names([1,2,3] => [Suns,Books,Flowers]). The names and ids are stored in HBase and can change everyday. Later I have to send the data to another Kafka topic.
So, I read data from two sources - Kafka and HBase. I read data from HBase using shc library.
First, I explode the array column (group ids), later I join with the data from HBase.
In next step I would like to aggregate the data using teacherId. But this operation is not supported in Append Mode which I use.
I have tried with watermarking, but at the moment it doesn't work. I added a new column with timestamp and I would group by this column.
Dataset<Row> inputDataset = //reading from Kafka
Dataset<Row> explodedDataset = // explode function applied and join with HBase
Dataset<Row> outputDataset = explodedDataset
.withColumn("eventTime", lit(current_timestamp()))
.withWatermark("eventTime", "2 minutes")
.groupBy(window(col("eventTime"), "5 seconds"), col("teacherId"))
.agg(collect_list(col("groupname")));
Actual results show empty dataframe at the output. There is not any row.
The problem is current_timestamp().
current_timestamp returns the timestamp in that moment, so, if you create a dataframe with this column and print the result, you print the current timestamp, but if you process the df and you print the same column, you print the new timestamp.
This solution works locally, but sometimes in a distributed system it fails because the workers when receiving the order to print the data, this data is already outside the timestamp range.