Databricks Delta Live Tables - Apply Changes from delta table - databricks

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!

As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".

When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Related

Dealing with overwritten files in Databricks Autoloader

Main topic
I am facing a problem that I am struggling a lot to solve:
Ingest files that already have been captured by Autoloader but were
overwritten with new data.
Detailed problem description
I have a landing folder in a data lake where every day a new file is posted. You can check the image example below:
Each day an automation post a file with new data. This file is named with a suffix meaning the Year and Month of the current period of the posting.
This naming convention results in a file that is overwritten each day with the accumulated data extraction of the current month. The number of files in the folder only increases when the current month is closed and a new month starts.
To deal with that I have implemented the following PySpark code using the Autoloader feature from Databricks:
# Import functions
from pyspark.sql.functions import input_file_name, current_timestamp, col
# Define variables used in code below
checkpoint_directory = "abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test/_checkpoint/sapex_ap_posted"
data_source = f"abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test"
source_format = "csv"
table_name = "prod_gbs_gpdi.bronze_data.sapex_ap_posted"
# Configure Auto Loader to ingest csv data to a Delta table
query = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "true")
.option("delimiter", ";")
.option("skipRows", 7)
.option("modifiedAfter", "2022-10-15 11:34:00.000000 UTC-3") # To ingest files that have a modification timestamp after the provided timestamp.
.option("pathGlobFilter", "AP_SAPEX_KPI_001 - Posted Invoices in *.CSV") # A potential glob pattern to provide for choosing files.
.load(data_source)
.select(
"*",
current_timestamp().alias("_JOB_UPDATED_TIME"),
input_file_name().alias("_JOB_SOURCE_FILE"),
col("_metadata.file_modification_time").alias("_MODIFICATION_TIME")
)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(table_name)
)
This code allows me to capture each new file and ingest it into a Raw Table.
The problem is that it works fine ONLY when a new file arrives. But if the desired file is overwritten in the landing folder the Autoloader does nothing because it assumes the file has already been ingested, even though the modification time of the file has chaged.
Failed tentative
I tried to use the option modifiedAfter in the code. But it appears to only serve as a filter to prevent files with a Timestamp to be ingested if it has the property before the threshold mentioned in the timestamp string. It dows not reingest files that have Timestamps before the modifiedAfter threshold.
.option("modifiedAfter", "2022-10-15 14:10:00.000000 UTC-3")
Question
Does someone knows how to detect a file that was already ingested but has a different modified date and how to reprocess that to load in a table?
I have figured out a solution to this problem. In the Autoloader Options list in Databricks documentation is possible to see an option called cloudFiles.allowOverwrites. If you enable that in the streaming query then whenever a file is overwritten in the lake the query will ingest it into the target table. Please pay attention that this option will probably duplicate the data whenever a new file is overwritten. Therefore, downstream treatment will be necessary.

rescued_data column of schemaEvolutionMode is not showing in output of readStream in Pyspark(databricks)

I am new to databricks (pyspark), I have few queries regarding pyspark syntax:
Do we need to follow any specific order when using options in readStream and writeStream. For example:-
(dataframe.readStream
.format("cloudFile)
.option("cloudFiles.format": "avro")
.option("multiline", True)
.schema(schema)
.load(path))
Delta table creation w/ both tableName and location option, is that right?? If I use both only I can see the files like .parquet, _delta log, checkpoint in the specified path and if I use tableName only I can see the table in hive meta store/spark catalog.bronze of SQL editor in databricks
The syntax i use, is it ok to use both .tableName() and .location() option
(DeltaTable.createIfNotExists(spark)
.tableName("%s.%s_%s" % (layer, domain, deltaTable))
.addColumn("x", "INTEGER")
.location(path)
.execute())
1st question: Do we need to follow any specific order when using options in readStream and writeStream.
Answer: No order required you can use different options like format, option, schema, etc in any order
2nd Question: Delta table creation w/ both tableName and location option, is that right?? If I use both only I can see the files like .parquet, _delta log, checkpoint in the specified path and if I use tableName only I can see the table in hive meta store/spark catalog.bronze of SQL editor in databricks
Answer: If you specify the location explicitly, it is external table (this table is created under specified location and once table alone is deleted the data still persists under data store location like mounted ADLS) and if you don't specify location it is managed table (create under default location and this table can be deleted once the table is deleted)
And to the question in the heading/top of the question about rescued data column:
Answer: Even though we use option to see rescued data, we need to select that rescued data column so that we can see it in the resultant dataframe. If we don't select it in: df.select(), then we can't see it in the result.
Please correct me if I am wrong.

Spark: refresh Delta Table in S3

how can I run the refresh table command on a Delta Table in S3?
When I do
deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)
I am getting the error:
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Does the refresh command only work for Hive tables?
Thanks!
Ok. It's really an incorrect function - the spark.catalog.refreshTable function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.
To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):
If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true
If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true), leave data only for that partitions, and write data using overwrite mode with replaceWhere option (doc) that contains condition
Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.

How to do append insertion in sparksql?

I have a api endpoint written by sparksql with the following sample code. Every time api accept a request it will run sparkSession.sql(sql_to_hive) which would create a single file in HDFS. Is there any way to do insert by appending data to existing file in HDFS ? Thanks.
sqlContext = SQLContext(sparkSession.sparkContext)
df = sqlContext.createDataFrame(ziped_tuple_list, schema=schema)
df.registerTempTable('TMP_TABLE')
sql_to_hive = 'insert into log.%(table_name)s partition%(partition)s select %(title_str)s from TMP_TABLE'%{
'table_name': table_name,
'partition': partition_day,
'title_str': title_str
}
sparkSession.sql(sql_to_hive)
I don't think this is possible case to append data to the existing file.
But you can work around this case by using either of these ways
Approach1
Using Spark, write to intermediate temporary table and then insert overwrite to final table:
existing_df=spark.table("existing_hive_table") //get the current data from hive
current_df //new dataframe
union_df=existing_df.union(current_df)
union_df.write.mode("overwrite").saveAsTable("temp_table") //write the data to temp table
temp_df=spark.table("temp_table") //get data from temp table
temp_df.repartition(<number>).write.mode("overwrite").saveAsTable("existing_hive_table") //overwrite to final table
Approach2:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer alter table concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
We can also use distributeby,sortby clauses to control number of files, refer this and this link for more details.
Another Approach3 is by using hadoop fs -getMerge to merge all small files into one (this method works for text files and i haven't tried for orc,avro ..etc formats).
When you write the resulted dataframe:
result_df = sparkSession.sql(sql_to_hive)
set it’s mode to append:
result_df.write.mode(SaveMode.Append).

Dataframe not able to write on S3

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')
Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Resources