I am trying to implement the Auto Loder using the Merge Into on multiple tables using the code below as stated in the documentation:
def upsert_data(df, epoch_id):
deltaTable = DeltaTable.forPath(spark, target_location)\
deltaTable.alias("t").merge(df.alias("s"),\
"t.xx = s.xx and t.xx1 = s.xx1") \
.whenMatchedUpdateAll()\
.whenNotMatchedInsertAll() \
.execute()
for i in range(len(list_of_files()[0])):
schema =list_of_files()[2][i]
raw_data = list_of_files()[1][i]
checkpoint= list_of_files()[3][i]
target_location = list_of_files()[4][i]
dfSource =list_of_files(raw_data)
dfMergedSchema = dfSource.where("1=0")
dfMergedSchema.createOrReplaceGlobalTempView("test1")
dfMergedSchema.write.option("mergeSchema","true").mode("append").format("delta")\
.save(target_location)
stream = spark.readStream\
.format("cloudFiles")\
.option("cloudFiles.format", "parquet")\
.option("header", "true")\
.schema(schema)\
.load(raw_data)
stream.writeStream.format("delta")\
.outputMode("append")\
.foreachBatch(upsert_data)\
.option("dataChange", "false")\
.trigger(once=True)\
.option("checkpointLocation", checkpoint)\
.start()
My scenario:
We have a Landing Zone where Parquet files are appended into multiple folders for example as shown below:
Landing Zone ---|
|-----folder 0 ---|----parquet1
| |----parquet2
| |----parquet3
|
|
|-----folder 1 ---|----parquet1
| |----parquet2
| |----parquet3
Then I am needing Auto Loader to create the tables as shown below with the checkpoints:
Staging Zone ---|
|-----folder 0 ---|----checkpoint
| |----table
|
|
|
|-----folder 1 ---|----checkpoint
| |----table
|
I am noticing that without the foreachBatch option in the Writestream, but with the Trigger once, the code works as expected for inserts for multiple tables as in above. The code also works when we have both foreachBatch and Trigger options on individual tables without the for loop. However, when I try to enable both options (foreachBatch and the Trigger Once) for multiple tables as in the for loops, Auto Loader is merging all the table contents into one table. You get a checkpoint, but no table contents for folder 0 in Staging Zone, and in folder 1, you get a checkpoint, but delta files that make up the table contents for both folder 0 and 1 in the table folder of folder 1. It's merging both tables into one.
I also get the ConcurrentAppendException.
I read about the ConcurrentAppendException in the documentation, and what I am finding is that you either use partitioning or have a disjointed condition in the upsert_data function passed into the foreachBatch option of the WriteStream. I tried both and none works.
How can one isolate the streams for the different folders in this scenario for the Staging Zone, while using foreachBatch and the Trigger Once in this for loop? There is something I am definitely missing with the foreachBatch option here because without it, Auto Loader is able to isolate the streams to folder 0 and folder 1, but with it, it's not.
Spoke with a Databricks Solution Architect today, and he mentioned that I needed to use a ThreadPoolExecutor, which is something outside the Auto Loader or Databricks itself, but native to Python. That will be in a helper function, which specifies the number of streams to handle the tables in parallel with Auto Loader. So, one can use a single instance of Auto Loader notebook for multiple tables, which meets my use case. Thanks!
Related
Main topic
I am facing a problem that I am struggling a lot to solve:
Ingest files that already have been captured by Autoloader but were
overwritten with new data.
Detailed problem description
I have a landing folder in a data lake where every day a new file is posted. You can check the image example below:
Each day an automation post a file with new data. This file is named with a suffix meaning the Year and Month of the current period of the posting.
This naming convention results in a file that is overwritten each day with the accumulated data extraction of the current month. The number of files in the folder only increases when the current month is closed and a new month starts.
To deal with that I have implemented the following PySpark code using the Autoloader feature from Databricks:
# Import functions
from pyspark.sql.functions import input_file_name, current_timestamp, col
# Define variables used in code below
checkpoint_directory = "abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test/_checkpoint/sapex_ap_posted"
data_source = f"abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/Test"
source_format = "csv"
table_name = "prod_gbs_gpdi.bronze_data.sapex_ap_posted"
# Configure Auto Loader to ingest csv data to a Delta table
query = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.option("header", "true")
.option("delimiter", ";")
.option("skipRows", 7)
.option("modifiedAfter", "2022-10-15 11:34:00.000000 UTC-3") # To ingest files that have a modification timestamp after the provided timestamp.
.option("pathGlobFilter", "AP_SAPEX_KPI_001 - Posted Invoices in *.CSV") # A potential glob pattern to provide for choosing files.
.load(data_source)
.select(
"*",
current_timestamp().alias("_JOB_UPDATED_TIME"),
input_file_name().alias("_JOB_SOURCE_FILE"),
col("_metadata.file_modification_time").alias("_MODIFICATION_TIME")
)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(table_name)
)
This code allows me to capture each new file and ingest it into a Raw Table.
The problem is that it works fine ONLY when a new file arrives. But if the desired file is overwritten in the landing folder the Autoloader does nothing because it assumes the file has already been ingested, even though the modification time of the file has chaged.
Failed tentative
I tried to use the option modifiedAfter in the code. But it appears to only serve as a filter to prevent files with a Timestamp to be ingested if it has the property before the threshold mentioned in the timestamp string. It dows not reingest files that have Timestamps before the modifiedAfter threshold.
.option("modifiedAfter", "2022-10-15 14:10:00.000000 UTC-3")
Question
Does someone knows how to detect a file that was already ingested but has a different modified date and how to reprocess that to load in a table?
I have figured out a solution to this problem. In the Autoloader Options list in Databricks documentation is possible to see an option called cloudFiles.allowOverwrites. If you enable that in the streaming query then whenever a file is overwritten in the lake the query will ingest it into the target table. Please pay attention that this option will probably duplicate the data whenever a new file is overwritten. Therefore, downstream treatment will be necessary.
I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.
Version: DBR 8.4 | Spark 3.1.2
Spark allows me to create a bucketed hive table and save it to a location of my choosing.
df_data_bucketed = (df_data.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
I have verified that this saves the table data to my specified path (in my case, blob storage).
In the future, the table 'data_bucketed' might wiped from my spark catalog, or mapped to something else, and I'll want to "recreate it" using the data that's been previously written to blob, but I can find no way to load a pre-existing, already bucketed spark table.
The only thing that appears to work is
df_data_bucketed = (spark.read.format("parquet").load(bucketed_path)
.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
Which seems non-sensical, because it's essentially loading the data from disk and unnecessarily overwriting it with the exact same data just to take advantage of the buckets. (It's also very slow due to the size of this data)
You can use spark SQL to create that table in your catalog
spark.sql("""CREATE TABLE IF NOT EXISTS tbl...""") following this you can tell spark to rediscover data by running spark.sql("MSCK REPAIR TABLE tbl")
I found the answer at https://www.programmerall.com/article/3196638561/
Read from the saved Parquet file If you want to use historically saved data, you can't use the above method, nor can you use
spark.read.parquet() like reading regular files. The data read in this
way does not carry bucket information. The correct way is to use the
CREATE TABLE statement. For details, refer
to https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
Examples are as follows:
spark.sql(
"""
|CREATE TABLE bucketed
| (name string)
| USING PARQUET
| CLUSTERED BY (name) INTO 10 BUCKETS
| LOCATION '/path/to'
|""".stripMargin)
I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?
To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...
I have a api endpoint written by sparksql with the following sample code. Every time api accept a request it will run sparkSession.sql(sql_to_hive) which would create a single file in HDFS. Is there any way to do insert by appending data to existing file in HDFS ? Thanks.
sqlContext = SQLContext(sparkSession.sparkContext)
df = sqlContext.createDataFrame(ziped_tuple_list, schema=schema)
df.registerTempTable('TMP_TABLE')
sql_to_hive = 'insert into log.%(table_name)s partition%(partition)s select %(title_str)s from TMP_TABLE'%{
'table_name': table_name,
'partition': partition_day,
'title_str': title_str
}
sparkSession.sql(sql_to_hive)
I don't think this is possible case to append data to the existing file.
But you can work around this case by using either of these ways
Approach1
Using Spark, write to intermediate temporary table and then insert overwrite to final table:
existing_df=spark.table("existing_hive_table") //get the current data from hive
current_df //new dataframe
union_df=existing_df.union(current_df)
union_df.write.mode("overwrite").saveAsTable("temp_table") //write the data to temp table
temp_df=spark.table("temp_table") //get data from temp table
temp_df.repartition(<number>).write.mode("overwrite").saveAsTable("existing_hive_table") //overwrite to final table
Approach2:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer alter table concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
We can also use distributeby,sortby clauses to control number of files, refer this and this link for more details.
Another Approach3 is by using hadoop fs -getMerge to merge all small files into one (this method works for text files and i haven't tried for orc,avro ..etc formats).
When you write the resulted dataframe:
result_df = sparkSession.sql(sql_to_hive)
set it’s mode to append:
result_df.write.mode(SaveMode.Append).