I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?
You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.
The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.
Related
So I have one data factory which runs every day, and it selects data from oracle on-premise database around 80M records and moves it to parquet file, which is taking around 2 hours I want to speed up this process... also the data flow process which insert and update data in db
parquet file setting
Next step is from parquet file it call the data flow which move data as upsert to database but this also taking too much time
data flow Setting
Let me know which compute type for data flow
Memory Optimized
Computed Optimized
General Purpose
After Round Robin Update
Sink Time
Can you open the monitoring detailed execution plan for the data flow? Click on each stage in your data flow and look to see where the bulk of the time is being spent. You should see on the top of the view how much time was spent setting-up the compute environment, how much time was taken to read your source, and also check the total write time on your sinks.
I have some examples of how to view and optimize this here.
Well, I would surmise that 45 min to stuff 85M files into a SQL DB is not horrible. You can break the task down into chunks and see what's taking the longest time to complete. Do you have access to Databricks? I do a lot of pre-processing with Databricks, and I have found Spark to be super-super-fast!! If you can pre-process in Databricks and push everything into your SQL world, you may have an optimal solution there.
As per the documentation - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-sink can you try modifying your partition settings under Optimize tab of your Sink ?
I faced similar issue with the default partitioning setting, where the data load was taking close to 30+ mins for 1M records itself, after changing the partition strategy to round robin and provided number of partitions as 5 (for my case) load is happening in less than a min.
Try experimenting with both Source partition (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-source) & Sink partition settings to come up with the optimum strategy. That should improve the data load time
My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink
I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).
I like to find the best methods of transferring 20 GB of SQL data from a SQL Server database installed on a customer onsite server, Client, to our Azure SQL Server, Source, on an S4 with 200 DTUs performance for $320 a month. When doing an initial setup, we set up an Azure Data Factory that copies over the 20 GB via multiple table copies, e.g., Client Table A's content to Source Table A, Client Table B's content to Source Table B, etc. Then we run many Extractors store procedures that insert into Stage tables the data from the Source tables by joining these Source Table together, e.g., Source A join to Source B. After that is incremental copies, but the initial setup do take forever.
Currently the copying time on an S4 is around 12 hours with the extracting time to be 4 hours. Increasing the performance tier to an S9 of 1600 DTUs for $2400 a month will decrease time to 6 hours with the extracting time to be 2 hours, but that bring with it the higher cost.
I was wondering if there was other Azure methods. Does setting up an HDInsight cluster with Hadoop or Spark be more efficient for cost compare to scaling up the Azure SQL DB to an S9 and more? An S9 of $2400 a month of 31 days is $3.28 an hour. Azure HDInsight Clusters of Memorized Optimized Nodes of a D14 v2 instance is $1.496 per hour so it would be cheaper than an S9. However, how does it compare in terms of performance. Would the copying process be quicker or will the extraction process be quicker?
I am not used to Big Data methods yet. Thank you for all the help.
Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores.Copy Activity offers a highly optimized data loading experience that is easy to configure and set up.
You can see the performance reference table about Copy Activity:
The table shows the copy throughput number in MBps for the given source and sink pairs in a single copy activity run based on in-house testing.
If you want the data could be transfered quicker by using Azure Data Factory Copy Activity, Azure provides three ways to achieve higher throughput:
Data integration units. A Data Integration Unit (DIU) (formerly known as Cloud Data Movement Unit or DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. You can achieve higher throughput by using more Data Integration Units (DIU).You are charged based on the total time of the copy operation. The total duration you are billed for data movement is the sum of duration across DIUs.
Parallel Copy. We can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use.For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.
Staged copy. When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store.
You can take these ways to tune the performance of your Data Factory service with Copy Activity.
For more details about Azure Data Factory Copy Activity performace, please see:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units
Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue. Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick , roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes.
We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure the speed.
Each file containing a collection of 50 xml objects (with varying amount of detail contained within child elements), the files are roughly 1 MB when Gzip’d and between 5MB and 10MB when not. 99% of the time processing time is spent within the EXTRACT section of the u-sql script.
Things tried,
Unzipped the files before processing, this took roughly the same time as the zipped version, certainly nowhere near the 40 seconds I was seeing locally.
Moved the data from Blob storage to Azure Data Lake storage, took exactly the same length of time.
Temporarily Removed about half of the data from the files and re-ran, surprisingly this didn’t take more than a minute off either.
Added more AU’s to increase the processing time, this worked extremely well but isn’t a long term solution due to the costs that would be incurred.
It seems to me as if there is a major bottleneck when getting the data from Azure Blob Storage/Azure Data Lake. Am I missing something obvious.
P.S. Let me know if you need any more information.
Thanks,
Nick.
See slide 31 of https://www.slideshare.net/MichaelRys/best-practices-and-performance-tuning-of-usql-in-azure-data-lake-sql-konferenz-2018. There is a preview option
SET ##FeaturePreviews="InputFileGrouping:on";
which groups small files into limited vertices.
In one of my project I received customer order details at the middle of each month which is an about 14 billion lines file. I need to upload them into my system (1 line per record) within 1 week then users can query.
I decided to use table storage to store based on price and performance consideration. But I found the performance of table storage is "2000 entities per second per partition" and "20,000 entities per second per account". https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
This means if I was using 1 storage account I need about 1 month to upload them which is not acceptable.
Is there any solution I can speed up to finish the upload task within 1 week?
The simple answer to this is to use multiple storage accounts. If you partition the data and stripe it across multiple storage accounts you can get as much performance as you need from it. You just need another layer to aggregate the data afterwards.
You could potentially have a slower process that is creating one large master table in the background.
You may have found this already, but there is an excellent article about importing large datasets into Azure Tables