ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again - delta-lake

This is the error which is coming while processing concurrent merge in delta lake tables in Azure Databricks .ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.. What are the options to stop this error What are the options to stop this error
To change the isolation level from the default WriteSerializable to Serializable
ALTER TABLE SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable')

Related

Databricks- ConcurrentAppendException:

I'm running like 20 notebooks concurrently and they all update the same Delta table (however, different rows). I'm getting the below exception if any two notebooks try to update the table at the same time).
Does setting 'delta.isolationLevel' = 'Serializable' for the Delta table fix the issue? Is there a better option?
ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again. Conflicting commit:

Error writing a partitioned Delta Table from a multitasking job in azure databricks

I have a notebook that writes a delta table with a statement similar to the following:
match = "current.country = updates.country and current.process_date = updates.process_date"
deltaTable = DeltaTable.forPath(spark, silver_path)
deltaTable.alias("current")\
.merge(
data.alias("updates"),
match) \
.whenMatchedUpdate(
set = update_set,
condition = condition) \
.whenNotMatchedInsert(values = values_set)\
.execute()
The multitask job has two tasks that are executed in parallel.
When executing the job the following error is displayed:
ConcurrentAppendException: Files were added to partition [country=Panamá,
process_date=2022-01-01 00:00:00] by a concurrent update. Please try the
operation again.
In each task I send different countries (Panama, Ecuador) and the same date as a parameter, so when executing only the information corresponding to the country sent should be written.
This delta table is partitioned by the country and process_date fields.
Any ideas what I'm doing wrong?
How should I specify the partition to be affected when using the "merge" statement?
I appreciate if you can clarify how I should work with the partitions in these cases, since this is new to me.
Update:
I made an adjustment in the condition to specify the country and process date according to what is indicated here (ConcurrentAppendException).
Now I get the following error message:
ConcurrentAppendException: Files were added to the root of the table
by a concurrent update. Please try the operation again.
I can't think what could cause the error. Keep investigating.
Error – ConcurrentAppendException: Files were added to the root of
the table by a concurrent update. Please try the operation again.
This exception is often thrown during concurrent DELETE, UPDATE, or MERGE operations. While the concurrent operations may be physically updating different partition directories, one of them may read the same partition that the other one concurrently updates, thus causing a conflict.
You can avoid this by making the separation explicit in the operation condition.
Update query would be executed for the Delta Lake target table when 'Update Strategy' transformation is used in the mapping. When multiple Update Strategy transformations are used for the same target table, multiple UPDATE queries would be executed in parallel and hence, target data would be unpredictable.
Due to the unpredictable data scenario in Delta Lake target for concurrent UPDATE queries, it is not supported to use more than one 'Update Strategy' transformation per 'Databricks Delta Lake Table' in a mapping. Redesign the mapping such that there is one 'Update Strategy' transformation per Delta Lake table.
Solution -
While running a mapping with one 'Update Strategy' transformation per Databricks Delta Lake table, execution would complete successfully.
Refer - https://docs.delta.io/latest/concurrency-control.html#avoid-conflicts-using-partitioning-and-disjoint-command-conditions
Initially, the affected table only had a date field as partition. So I partitioned it with country and date fields.
This new partition created the country and date directories however the old directories of the date partition remained and were not deleted.
Apparently these directories were causing the conflict when trying to read them concurrently.
I created a new delta on another path with the correct partitions and then replaced it on the original path. This allowed old partition directories to be removed.
The only consequence of performing these actions was that I lost the change history of the table (time travel).

Databricks Delta cache contains a stale footer and stale page entries Error

I have been getting notebook failures intermittingly relating to querying a TEMPORARY VIEW that is selecting from a parquet file located on a ADLS Gen2 mount.
Delta cache contains a stale footer and stale page entries for the file dbfs:/mnt/container/folder/parquet.file, these will be removed (4 stale page cache entries). Fetched file stats (modificationTime: 1616064053000, fromCachedFile: false) do not match file stats of cached footer and entries (modificationTime: 1616063556000, fromCachedFile: true).
at com.databricks.sql.io.parquet.CachingParquetFileReader.checkForStaleness(CachingParquetFileReader.java:700)
at com.databricks.sql.io.parquet.CachingParquetFileReader.close(CachingParquetFileReader.java:511)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.close(SpecificParquetRecordReaderBase.java:327)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.close(VectorizedParquetRecordReader.java:164)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.close(DatabricksVectorizedParquetRecordReader.java:484)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.close(RecordReaderIterator.scala:70)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:45)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
The a datafactory Copy Data activity is performed to Source (from mssql table) and Sink (Parquet file) using snappy compression before the notebook command is executed. No other activities or pipelines write to this file. However, multiple notebooks will perform selects against this same parquet file.
From what I can tell from the error message, the delta cache is older than the parquet file itself. Is there a way to turn off the caching for this particular file (it is very small dataset) or invalidate the cache prior to the Data Copy activity? I am aware of the CLEAR CACHE command but this does it for all tables and not specifically temp views.
We have a similar process and we have been having the exact same problem.
If you need to invalidate the cache for a specific file/folder you can use something like the following Spark-SQL command:
REFRESH {file_path}
Where file path is either the path through the dbfs or your mount.
Worth noting is that if you specify a folder instead of a file all files within that folder (recursively) will be refreshed.
This also may very well not solve your problem. It seems to have helped us, but that is more of a gut feeling as we have not been actively looking at the frequency of these failures.
The documentation.
Our Specs:
Azure
Databricks Runtime 7.4
Driver: Standard_L8s_v2
Workers: 24 Standard_L8s_v2

Concurrent update to delta lake table via multiple jobs

I have a delta table, where multiple jobs via databricks can merge/upsert data into the delta table concurrently.
How can I prevent from getting ConcurrentAppendException?
I cannot use this solution, as the incoming changes can be a part of any partition and I cannot filter any partition.
Is there a way to check whether the Delta table is being appended/merged/updated/deleted and wait until its completed and then we acquire the locks and start the merge for the second job?
Just FYI, these are 2 independent Azure Datafactory jobs trying to update one delta table.
Cheers!
You should handle concurrent appends to Delta as any other data store with Optimistic Offline Locking - by adding application-specific retry logic to your code whenever that particular exception happens.
Here's a good video on inner workings of Delta.

Spark SQL table lock during ALTER TABLE tbl PARTITION SET LOCATION

We're using Spark SQL 2.2.0 with Hive Metastore (on HDInsight).
We have external tables build on partitioned parquet files stored on Azure BLOB. The data will be sent to the BLOB in parquet format, we have no influence on this.
We need to accept partitionwise data updates (aka. restatements) with minimal impact on:
the downstream systems that run queries on the data (avoid breaking queries and long waiting, etc.)
the data update process (avoid long waiting and complex logic whenever possible)
We are considering doing something like this as a way to perform the updates (nothing very special):
ALTER TABLE tbl PARTITION(YEAR=2018, MONTH=1, DAY=30)
SET LOCATION 'wasb:///mylocation/table/20180130/v2'
What table (or partition) locking mechanisms/logic can we expect? I've googled and the answer is unclear to me.
Are there any parameters in Hive/Spark can we use to control it, except for turning all the concurrency on/off with hive.support.concurrency?
Any other ways to solve this kind of problem? We've tested direct overwriting parquet files in particular partition folders but it seems more cumbersome as it requires running recoverPartitions and recreating DataFrames over and over.

Resources