Databricks delta concurency issue when using merge update on different partitions parallely - apache-spark

I am getting the below error when doing a merge update on different partitions of the same table parallely. These are independent data and I have made sure there is no data overlap between the partitions.
Error: com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [source_system=ABC] by a concurrent update. Please try the operation again.
Table Properties: USING DELTA LOCATION '${s3_location}' PARTITIONED BY (source_system) TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite'='true', 'delta.autoOptimize.autoCompact'='true' );
Any how I could fix this? Thanks.

Related

How to parallelly merge data into partitions of databricks delta table using PySpark/Spark streaming?

I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table.
In the beginning we were loading data into the delta table by using the merge function as given below.
This incoming dataframe inc_df had data for all partitions.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
We were executing the above query on table level.
I have given a very basic diagram of the process in the image below.
But my delta table is partitioned on continent and year.
For example, this is how my partitioned delta table looks like.
So I tried implementing the merge on partition level and tried to run merge activity on multiple partitions parallelly.
i.e. I have created seperate pipelines with the filters in queries on partition levels. Image can be seen below.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent in ('AFRICA') AND main_dt.year in (‘202301’) AND
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
But I am seeing an error with concurrency.
- com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [continent=AFRICA, year=2021] by a concurrent update. Please try the operation again.
I understand that the error is telling me that it cannot update files concurrently.
But I have huge volume of data in production and I don't want to perform merge on table level where there are almost 1billion records without proper filters.
Trial2:
As an alternate approach,
I saved my incremental dataframe in an S3 bucket (like a staging dir) and end my streaming pipeline there.
Then I have a seperate PySpark job that reads data from that S3 staging dir and performs merge into my main delta table, once again on partition level (I have specified partitions in those jobs as filters)
But I am facing the same exception/error there as well.
Could anyone let me know how can I design and optimise my streaming pipeline to merge data into delta table on partition level by having multiple jobs parallelly (jobs running on indivdual partitions)
Trial3:
I also made another attempt in a different approach as mentioned in the link and ConcurrentAppendException section from that page.
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='Africa'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
and
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='ASIA'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
I ran the above merge operations in two separate pipelines.
But I am still facing the same issue.
In your trial 3, you need to change the merge condition.
Instead of
condition="main_dt.continent=df.continent AND [...]"
it should be
condition="main_dt.continent='Africa' AND [...]"
You should also delete the continent='Africa' from the end of the condition.
Here is the documentation for reference.

spark streaming and delta tables: java.lang.UnsupportedOperationException: Detected a data update

The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.

Alter table to add partition taking long time on Hive external table

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.
Refresh command takes as long as 3-6 hours to complete which is too long.
Nature of data in Hive:
27Gb of data located on S3.
Stored in parquet.
Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).
Note- Its a date wise partition and cannot be changed.
Spark config used:
Num-executors= 15
Executor-memory =16Gb
Executor-cores = 2
Driver-memory= 49Gb
Spark-shuffle-partitions=48
Hive.exec.dynamic.partition.mode=nonstrict
Spark.sql.sources.partitionOverwriteMode=dynamic.
Things tried:
Tuning the spark cores/memory/executors but no luck.
Refresh table command.
Alter table add partition command.
Hive cli taking 3-4 hours to complete MSCK repair table tablename
All the above had no effect on reducing the time to refresh the partition on Hive.
Some assumptions:
Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
Currently number of partitions on table are close to 10k is this an issue.?
Any help will be much appreciated.
incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)
use R type instance. It provides more memory compared to M type instances at same price
use coalesce to merge the files in source if there are many small files.
Check the number of mapper tasks. The more the task, lesser the performance
use EMRFS rather than S3 to keep the metadata info
use below
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
Follow some of the instructions from below Link

HIVE partition & bucketing support in Spark not working as expected

When working with partitions in S3, Spark is listing down all the partitions one by one, which is consuming time.Rather it should look for the partition in the meta-store table & should go to the partition immediately.
I tried with an example of 125 partitions.When I calculate the exact location of the S3 by appending the partition column value & try to access it, it executes within 5sec.But if I try to let Spark figures out the partition, it is listing down all the partitions, which itself is taking more than 30 sec.
How can I let Spark figures out the partition from the meta-store using the predicate push-down?
You need to setup external hive metastore(it can be mysql or postgres). So the definitions of tables/partitions will be persisted there and will survive different spark context lifespans

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh() to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?
Yes, spark supports partition pruning.
Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.
These are the logs which shows partitions being listed to populate the cache.
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver
These are the logs showing pruning is happening.
App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.
Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.
Just a thought:
Spark API documentation for HadoopFsRelation says,
( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )
"...when reading from Hive style partitioned tables stored in file
systems, it's able to discover partitioning information from the paths
of input directories, and perform partition pruning before start
reading the data..."
So, i guess "listLeafFilesInParallel" could not be a problem.
A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673
In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the
issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all
the partitions are registered. Else, recover your partitions as shown in this link:
Hive doesn't read partitioned parquet files generated by Spark
Update:
I guess appropriate parquet block size and page size were set while writing the data.
Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach.
( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions )
Run a plain hive query and then compare by running a spark program.
Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.
similar question popped up here recently:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reads-all-leaf-directories-on-a-partitioned-Hive-table-td35997.html#a36007
This question is old but I thought I'd post the solution here as well.
spark.sql.hive.convertMetastoreParquet=false
will use the Hive parquet serde instead of the spark inbuilt parquet serde. Hive's Parquet serde will not do a listLeafFiles on all partitions, but only and directly read from the selected partitions. On tables with many partitions and files, this is much faster (and cheaper, too). Feel free to try it ou! :)

Resources