Concurrent update to delta lake table via multiple jobs - apache-spark

I have a delta table, where multiple jobs via databricks can merge/upsert data into the delta table concurrently.
How can I prevent from getting ConcurrentAppendException?
I cannot use this solution, as the incoming changes can be a part of any partition and I cannot filter any partition.
Is there a way to check whether the Delta table is being appended/merged/updated/deleted and wait until its completed and then we acquire the locks and start the merge for the second job?
Just FYI, these are 2 independent Azure Datafactory jobs trying to update one delta table.
Cheers!

You should handle concurrent appends to Delta as any other data store with Optimistic Offline Locking - by adding application-specific retry logic to your code whenever that particular exception happens.
Here's a good video on inner workings of Delta.

Related

Costs Databricks Delta Live Tables

Will a Databricks Delta Live Table generate costs regardless of finding data to load? And would a solution in that case be
to set the job to disabled if you know new data is not filling the source for a while?
Delta Live Tables job will incur costs only when it's running. Usually jobs are set to be triggered regularly, so even if you don't have data the cluster will be created, then DLT pipeline will be executed, and if no data is found, then (most probably, but depends on the job), it will just finish as there is no data to process. In this case the costs will be relatively small until you setup a pipeline in continuous mode.
If you know that no data is arriving, you can pause a job that triggers DLT pipeline.

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

Error writing a partitioned Delta Table from a multitasking job in azure databricks

I have a notebook that writes a delta table with a statement similar to the following:
match = "current.country = updates.country and current.process_date = updates.process_date"
deltaTable = DeltaTable.forPath(spark, silver_path)
deltaTable.alias("current")\
.merge(
data.alias("updates"),
match) \
.whenMatchedUpdate(
set = update_set,
condition = condition) \
.whenNotMatchedInsert(values = values_set)\
.execute()
The multitask job has two tasks that are executed in parallel.
When executing the job the following error is displayed:
ConcurrentAppendException: Files were added to partition [country=Panamá,
process_date=2022-01-01 00:00:00] by a concurrent update. Please try the
operation again.
In each task I send different countries (Panama, Ecuador) and the same date as a parameter, so when executing only the information corresponding to the country sent should be written.
This delta table is partitioned by the country and process_date fields.
Any ideas what I'm doing wrong?
How should I specify the partition to be affected when using the "merge" statement?
I appreciate if you can clarify how I should work with the partitions in these cases, since this is new to me.
Update:
I made an adjustment in the condition to specify the country and process date according to what is indicated here (ConcurrentAppendException).
Now I get the following error message:
ConcurrentAppendException: Files were added to the root of the table
by a concurrent update. Please try the operation again.
I can't think what could cause the error. Keep investigating.
Error – ConcurrentAppendException: Files were added to the root of
the table by a concurrent update. Please try the operation again.
This exception is often thrown during concurrent DELETE, UPDATE, or MERGE operations. While the concurrent operations may be physically updating different partition directories, one of them may read the same partition that the other one concurrently updates, thus causing a conflict.
You can avoid this by making the separation explicit in the operation condition.
Update query would be executed for the Delta Lake target table when 'Update Strategy' transformation is used in the mapping. When multiple Update Strategy transformations are used for the same target table, multiple UPDATE queries would be executed in parallel and hence, target data would be unpredictable.
Due to the unpredictable data scenario in Delta Lake target for concurrent UPDATE queries, it is not supported to use more than one 'Update Strategy' transformation per 'Databricks Delta Lake Table' in a mapping. Redesign the mapping such that there is one 'Update Strategy' transformation per Delta Lake table.
Solution -
While running a mapping with one 'Update Strategy' transformation per Databricks Delta Lake table, execution would complete successfully.
Refer - https://docs.delta.io/latest/concurrency-control.html#avoid-conflicts-using-partitioning-and-disjoint-command-conditions
Initially, the affected table only had a date field as partition. So I partitioned it with country and date fields.
This new partition created the country and date directories however the old directories of the date partition remained and were not deleted.
Apparently these directories were causing the conflict when trying to read them concurrently.
I created a new delta on another path with the correct partitions and then replaced it on the original path. This allowed old partition directories to be removed.
The only consequence of performing these actions was that I lost the change history of the table (time travel).

How to insert into Delta table in parallel

I have a process which in short runs 100+ of the same databricks notebook in parallel on a pretty powerful cluster. Each notebook at the end of its process writes roughly 100 rows of data to the same Delta Lake table stored in an Azure Gen1 DataLake. I am seeing extremely long insert times into Delta for what I can only assume is Delta doing some sort of locking the table while an insert occurs and then freeing it up once a single notebook finishes, which based on reading https://docs.databricks.com/delta/concurrency-control.html it is implied that there are no insert conflicts and that multiple writers across multiple clusters can simultaneously insert data.
This insertion for 100 rows per notebook for the 100+ notebook takes over 3 hours. The current code that is causing the bottleneck is:
df.write.format("delta").mode("append").save("<path_>")
Currently there are no partitions on this table which could be a possible fix but before going down this route is there something I am missing in terms of how you get un-conflicted inserts in parallel?
You have to choose between two types of isolation levels for your table and the weaker one is the default, so there is no running away from isolation levels.
https://docs.databricks.com/delta/optimizations/isolation-level.html
Delta Lake has OCC (Optimistic Concurrency Control) this means that the data you want to write to your table is validated against all of the data that the other 99 processes want to write. This means that 100*100=10000 validations are being made.
https://en.wikipedia.org/wiki/Optimistic_concurrency_control
Please also bear in mind that your data processing architecture will finish when the last notebook of the 100 finishes. Maybe one or multiple of the 100 notebooks takes 3 hours to finish and the insert is not to blame?
If long running notebooks is not the case I would suggest you try to store your result data from each notebook in some sort of data structure (e.g. store it in 100 files from each notebook) and then batch insert the data of the data structure (e.g. files) to the destination table.
The data processing will be parallel, the insert will not be parallel.

How to update or even reset rows in persistent table given multiple simultaneous readers?

I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!
I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.

Resources