Will a Databricks Delta Live Table generate costs regardless of finding data to load? And would a solution in that case be
to set the job to disabled if you know new data is not filling the source for a while?
Delta Live Tables job will incur costs only when it's running. Usually jobs are set to be triggered regularly, so even if you don't have data the cluster will be created, then DLT pipeline will be executed, and if no data is found, then (most probably, but depends on the job), it will just finish as there is no data to process. In this case the costs will be relatively small until you setup a pipeline in continuous mode.
If you know that no data is arriving, you can pause a job that triggers DLT pipeline.
Related
I'm writing data to Kusto using azure-kusto-spark. This write has very small data. This spark-kusto connector uses batch streaming.
But I see this write has high latency running in like 8 minutes.
From the logs I see, this high latency is in the staging ingestion part. I see that a temporary staging table is created, and data is ingested to it via multiple jobs. Only when staging ingestion is completed, data is finally merged and written to kusto table.
Can anyone give pointers why this small data is having so much of latency ?
It might be because of your Kusto table ingestion policy as described here https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSink.md#performance-considerations
You might want to modify the defautl ingestion policy and set the time to less than the default value which is 5 min. See this doc for reference https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/batchingpolicy#defaults-and-limits
I have a delta table, where multiple jobs via databricks can merge/upsert data into the delta table concurrently.
How can I prevent from getting ConcurrentAppendException?
I cannot use this solution, as the incoming changes can be a part of any partition and I cannot filter any partition.
Is there a way to check whether the Delta table is being appended/merged/updated/deleted and wait until its completed and then we acquire the locks and start the merge for the second job?
Just FYI, these are 2 independent Azure Datafactory jobs trying to update one delta table.
Cheers!
You should handle concurrent appends to Delta as any other data store with Optimistic Offline Locking - by adding application-specific retry logic to your code whenever that particular exception happens.
Here's a good video on inner workings of Delta.
I have a process which in short runs 100+ of the same databricks notebook in parallel on a pretty powerful cluster. Each notebook at the end of its process writes roughly 100 rows of data to the same Delta Lake table stored in an Azure Gen1 DataLake. I am seeing extremely long insert times into Delta for what I can only assume is Delta doing some sort of locking the table while an insert occurs and then freeing it up once a single notebook finishes, which based on reading https://docs.databricks.com/delta/concurrency-control.html it is implied that there are no insert conflicts and that multiple writers across multiple clusters can simultaneously insert data.
This insertion for 100 rows per notebook for the 100+ notebook takes over 3 hours. The current code that is causing the bottleneck is:
df.write.format("delta").mode("append").save("<path_>")
Currently there are no partitions on this table which could be a possible fix but before going down this route is there something I am missing in terms of how you get un-conflicted inserts in parallel?
You have to choose between two types of isolation levels for your table and the weaker one is the default, so there is no running away from isolation levels.
https://docs.databricks.com/delta/optimizations/isolation-level.html
Delta Lake has OCC (Optimistic Concurrency Control) this means that the data you want to write to your table is validated against all of the data that the other 99 processes want to write. This means that 100*100=10000 validations are being made.
https://en.wikipedia.org/wiki/Optimistic_concurrency_control
Please also bear in mind that your data processing architecture will finish when the last notebook of the 100 finishes. Maybe one or multiple of the 100 notebooks takes 3 hours to finish and the insert is not to blame?
If long running notebooks is not the case I would suggest you try to store your result data from each notebook in some sort of data structure (e.g. store it in 100 files from each notebook) and then batch insert the data of the data structure (e.g. files) to the destination table.
The data processing will be parallel, the insert will not be parallel.
So I have one data factory which runs every day, and it selects data from oracle on-premise database around 80M records and moves it to parquet file, which is taking around 2 hours I want to speed up this process... also the data flow process which insert and update data in db
parquet file setting
Next step is from parquet file it call the data flow which move data as upsert to database but this also taking too much time
data flow Setting
Let me know which compute type for data flow
Memory Optimized
Computed Optimized
General Purpose
After Round Robin Update
Sink Time
Can you open the monitoring detailed execution plan for the data flow? Click on each stage in your data flow and look to see where the bulk of the time is being spent. You should see on the top of the view how much time was spent setting-up the compute environment, how much time was taken to read your source, and also check the total write time on your sinks.
I have some examples of how to view and optimize this here.
Well, I would surmise that 45 min to stuff 85M files into a SQL DB is not horrible. You can break the task down into chunks and see what's taking the longest time to complete. Do you have access to Databricks? I do a lot of pre-processing with Databricks, and I have found Spark to be super-super-fast!! If you can pre-process in Databricks and push everything into your SQL world, you may have an optimal solution there.
As per the documentation - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-sink can you try modifying your partition settings under Optimize tab of your Sink ?
I faced similar issue with the default partitioning setting, where the data load was taking close to 30+ mins for 1M records itself, after changing the partition strategy to round robin and provided number of partitions as 5 (for my case) load is happening in less than a min.
Try experimenting with both Source partition (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-source) & Sink partition settings to come up with the optimum strategy. That should improve the data load time
My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink
I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).