I was working with streaming in delta tables with foreachbatch.
spark.readStream.format("delta") \
.option("readChangeFeed", "true").option("startingVersion","latest") \
.load("dbfs:/mnt/cg/file_to_up/table2") \
.writeStream.queryName("myquery") \
.foreachBatch(merge_deltas) \
.option("checkpointLocation", "dbfs:/mnt/cg/tmp/checkpoint") \
.trigger(processingTime='2 seconds') \
.start()
I would like to know if there is a way to do the merge between a delta table(source) and synapse table(target) I found this page here in "Write to Azure Synapse Analytics using foreachBatch() in Python" is the closest I have found to what I need, I think that use the modes(append, overwrite) of the write will not help me.
Also, I know that it could be simulated using insert and delete in post and pre action option, but if there is way to do it with the merge that would be great.
You can use a combination of merge and foreachBatch (see foreachbatch for more information) to write complex upserts from a streaming query into a Delta table. For example:
• Write streaming aggregates in Update Mode: This is much more efficient than Complete Mode.
• Write a stream of database changes into a Delta table: The merge query for writing change data can be used in foreachBatch to continuously apply a stream of changes to a Delta table.
• Write a stream data into Delta table with deduplication: The insert-only merge query for deduplication can be used in foreachBatch to continuously write data (with duplicates) to a Delta table with automatic deduplication.
For more information refer this official documentation
Note -
• Make sure that your merge statement inside foreachBatch is
idempotent as restarts of the streaming query can apply the operation
on the same batch of data multiple times.
• When merge is used in foreachBatch, the input data rate of the
streaming query (reported through StreamingQueryProgress and visible
in the notebook rate graph) may be reported as a multiple of the
actual rate at which data is generated at the source. This is because
merge reads the input data multiple times causing the input metrics to
be multiplied. If this is a bottleneck, you can cache the batch
DataFrame before merge and then uncache it after merge.
Related
I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table.
In the beginning we were loading data into the delta table by using the merge function as given below.
This incoming dataframe inc_df had data for all partitions.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
We were executing the above query on table level.
I have given a very basic diagram of the process in the image below.
But my delta table is partitioned on continent and year.
For example, this is how my partitioned delta table looks like.
So I tried implementing the merge on partition level and tried to run merge activity on multiple partitions parallelly.
i.e. I have created seperate pipelines with the filters in queries on partition levels. Image can be seen below.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent in ('AFRICA') AND main_dt.year in (‘202301’) AND
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
But I am seeing an error with concurrency.
- com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [continent=AFRICA, year=2021] by a concurrent update. Please try the operation again.
I understand that the error is telling me that it cannot update files concurrently.
But I have huge volume of data in production and I don't want to perform merge on table level where there are almost 1billion records without proper filters.
Trial2:
As an alternate approach,
I saved my incremental dataframe in an S3 bucket (like a staging dir) and end my streaming pipeline there.
Then I have a seperate PySpark job that reads data from that S3 staging dir and performs merge into my main delta table, once again on partition level (I have specified partitions in those jobs as filters)
But I am facing the same exception/error there as well.
Could anyone let me know how can I design and optimise my streaming pipeline to merge data into delta table on partition level by having multiple jobs parallelly (jobs running on indivdual partitions)
Trial3:
I also made another attempt in a different approach as mentioned in the link and ConcurrentAppendException section from that page.
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='Africa'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
and
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='ASIA'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
I ran the above merge operations in two separate pipelines.
But I am still facing the same issue.
In your trial 3, you need to change the merge condition.
Instead of
condition="main_dt.continent=df.continent AND [...]"
it should be
condition="main_dt.continent='Africa' AND [...]"
You should also delete the continent='Africa' from the end of the condition.
Here is the documentation for reference.
The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.
Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!
I am trying to do Structured Streaming (Spark 2.4.0) on Kafka source data where I am reading latest data and performing aggregations on a 10 minute window. I am using "update" mode while writing the data.
For example, the data schema is as below:
tx_id, cust_id, product, timestamp
My aim is to find customers who have bought more than 3 products in last 10 minutes. Let's say prod is the dataframe which is read from kafka, then windowed_df is defined as:
windowed_df_1 = prod.groupBy(window("timestamp", "10 minutes"), cust_id).count()
windowed_df = windowed_df_1.filter(col("count")>=3)
Then I am joining this with a master dataframe from hive table "customer_master" to get cust_name:
final_df = windowed_df.join(customer_master, "cust_id")
And finally, write this dataframe to Kafka sink (or console for simplicity)
query = final_df.writeStream.outputMode("update").format("console").option("truncate",False).trigger(processingTime='2 minutes').start()
query.awaitTermination()
Now, when this code runs every 2 minutes, in the subsequent runs, I want to discard all those customers who were already part of my output. I don't want them in my output even if they buy any product again.
Can I write the stream output temporarily somewhere (may be a hive table) and do an "anti-join" for each execution ?
This way I can also have a history maintained in a hive table.
I also read somewhere that we can write the output to a memory sink and then use df.write to save it in HDFS/Hive. But what if we terminate the job and re-run ? The in-memory table will be lost in this case I suppose.
Please help as I am new to Structured Streaming.
**
Update: -
**
I also tried below code to write output in Hive table as well as Console(or Kafka sink):
def write_to_hive(df, epoch_id):
df.persist()
df.write.format("hive").mode("append").saveAsTable("hive_tab_name")
pass
final_df.writeStream.outputMode("update").format("console").option("truncate", False).start()
final_df.writeStream.outputMode("update").foreachBatch(write_to_hive).start()
But this only performs the 1st action, i.e. write to Console.
If i write "foreachBatch" first, it will save to Hive table but does not print to console.
I want to write to 2 different sinks. Please help.
I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.