Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue. Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick , roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes.
We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure the speed.
Each file containing a collection of 50 xml objects (with varying amount of detail contained within child elements), the files are roughly 1 MB when Gzip’d and between 5MB and 10MB when not. 99% of the time processing time is spent within the EXTRACT section of the u-sql script.
Things tried,
Unzipped the files before processing, this took roughly the same time as the zipped version, certainly nowhere near the 40 seconds I was seeing locally.
Moved the data from Blob storage to Azure Data Lake storage, took exactly the same length of time.
Temporarily Removed about half of the data from the files and re-ran, surprisingly this didn’t take more than a minute off either.
Added more AU’s to increase the processing time, this worked extremely well but isn’t a long term solution due to the costs that would be incurred.
It seems to me as if there is a major bottleneck when getting the data from Azure Blob Storage/Azure Data Lake. Am I missing something obvious.
P.S. Let me know if you need any more information.
Thanks,
Nick.
See slide 31 of https://www.slideshare.net/MichaelRys/best-practices-and-performance-tuning-of-usql-in-azure-data-lake-sql-konferenz-2018. There is a preview option
SET ##FeaturePreviews="InputFileGrouping:on";
which groups small files into limited vertices.
Related
I'm after a bit of understanding, I'm not stuck on anything but I'm trying to understand something better.
When loading a data warehouse why is it always suggested that we load data into blob storage or a data lake first? I understand that it's very quick to pull data from there, however in my experience there are a couple of pitfalls. The first is that there is a file size limit and if you load too much data into 1 file as I've seen happen it causes the load to error at which point we have to switch the load to incremental. This brings me to my second issue, I always thought the point of loading into blob storage was to chuck all the data in there so you can access it in the future without stressing the front end systems, if I can't do that because of file limits then what's the point of even using blob storage, we might as well load data straight into staging tables. It just seems like an unnecessary step to me when I've ran data warehouses in the past without this part involved and to me they have worked better.
Anyway my understanding of this part is not as good as I'd like it to be, and I've tried finding articles that answer these specific questions but none have really explained the concept to me correctly. Any help or links to good articles I could read would be much appreciated.
One reason for placing the data in blob or data lake is so that multiple parallel readers can be used on the data at the same time. The goal of this is to read the data in a reasonable time. Not all data sources support such type of read operations. Given the size of your file, a single reader would take a long long time.
One such example could be SFTP. Not all SFTP servers support offset reads. Some may have further restrictions on concurrent connections. Moving the data first to Azure services provides a known set of capabilities / limitation.
In your case, I think what you need, is to partition the file, like what HDFS might do. If I knew what data source you are using, I could have a further suggestion.
Before I had time to get an ingestion strategy & process setup, I started collecting data that will eventually go through a Stream Analytics job. Now I'm sitting on an Azure blob storage container with over 500,000 blobs in it (no folder organization), another with 300,000 and a few others with 10,000 - 90,000.
The production collection process now writes these blobs to different containers in the YYYY-MM-DD/HH format, but that's only great going forward. This archived data I have is critical to get into my system and I'd like to just modify the inputs a bit for the existing production ASA job so I can leverage the same logic in the query, functions and other dependencies.
I know ASA doesn't like batches of more than a few hundred / thousand, so I'm trying to figure a way to stage my data in order to work well under ASA. This would be a one time run...
One idea was to write a script that looked at every blob, looked at the timestamp within the blob and re-create the YYYY-MM-DD/HH folder setup, but in my experience, the ASA job will fail when the blob's lastModified time doesn't match the folders it's in...
Any suggestions how to tackle this?
EDIT: Failed to mention (1) there are no folders in these containers... all blobs live at the root of the container and (2) my LastModifiedTime on the blobs is no longer useful or have meaning. The reason for the latter is these blobs were collected from multiple other containers and merged together using the Azure CLI copy-batch command.
Can you please try below?
Do this processing in two different jobs, one for the folders with date partitioning (say partitionedJob). Another for old blobs without any date partitioning (say RefillJob)
Since RefillJob has a fixed number of blobs, put a predicate on System.Timestamp to make sure that it only processes old events. Start this job with at least 6 SUs and run it until all the events have been processed. You can confirm by looking at LastOutputProcessedTime or by looking at the input event count or by inspecting your output source. After this check, stop the job. This job is no longer needed.
Start the partitionedJob with timestamp > RefillJob. This assumes the folders for the timestamps exists.
I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?
You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.
The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.
I am trying to load the data from my External Tables to SQL DW Internal tables. I have the data stores in a compressed format in BLOB Storage and External tables are pointed to the BLOB Storage Location.
I have around 24 files, which is around 22GB of size and trying to load the data from External table to a Internal table on 300 DWU with a largerc resource class service/user account.
My insert into statement ( which is very straight forward) is running for more than 10 hours.
insert into Trxdata.Details_data select * from Trxdata.Stage_External_Table_details_data;
I also tried with below statement, thats also running for more than 10 hours.
CREATE TABLE Trxdata.Details_data12
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT *
FROM Trxdata.Stage_External_Table_details_data
;
I see - both the SQLs are running with ACTIVE status in "sys"."dm_pdw_exec_requests" [ I was thinking, it may be concurrency slot issue and it hasnt got concurrency slots to run, but its not the case]
and I was hoping , increasing/scaling up DWU - might improve the performance. but looking at the DWU usage in portal.azure.com - I am not convinced to increased the DWU because the DWU usage chart shows <50DWU for the last 12 hours
DWU USage chart
So, I am trying to understand- how can I find - what is taking such a long time, How can I improve the performance of my data load ?
I suspect your problem lies with the file(s) being compressed. Many azure documents state that you will only get one reader per compressed file. As a test I would suggest you decompress your data and try a load and see if decompressing/load is faster than then 10 hours loading compressed data you are currently seeing. I also have better luck with several files rather than 1 large file, if that is an option for your system.
Please have a look at the below blog from SQL CAT on data loading optimizations.
https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/
Based on the info provided, a couple things to consider are:
1) Locality of the blob files compared to the DW instance. Make sure they are in the same region.
2) Clustered Columnstore is on by default. If you are loading 22GB of data, a HEAP load may perform better (but not sure on row count either). So:
CREATE TABLE Trxdata.Details_data12
WITH (HEAP, DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM Trxdata.Stage_External_Table_details_data ;
If the problem still persists, please file a support ticket:
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-create-support-ticket/
You mention that the data is in a compressed format. How many compressed files does the data reside in? For compressed files, you'll achieve more parallelism and thus better performance when the data is spread across many files. Having the data in multiple files is not needed for uncompressed files in order to achieve better performance, so another way to test if this is your performance issue is to un-compress your files.
Rackspace cloud files uses a flat storage system using 'containers' to store files. According to Rackspace there is no limit to the number of files per container.
My question is whether there is a best/most efficient number of files per container to optimize write/fetch performance.
If I have tens of thousands of files to store, should they all go in a single giant container or partitioned into many smaller containers? And if so, what is the optimal container size?
FYI:
[Snippets taken from rackspace support]
long story short, the containers are databases, and the more rows in a table, the more time it takes to write them on standard hardware. When a write hasn't been committed to disk, it sits in a queue, and it subject to data loss. It's something we noticed with large containers, and the more objects, the more likely it was, so we instituted the limits to protect the data.
because of the rate limits, your data is safe, it just slows down the writes a bit
the limits starts as low as 50,000 objects, and at that level it limits you to 100 writes per second
by 1,000,000 objects in a container, it's 25 per second
and at 5 million and above, you're down to 4 writes per second
We apologize for the limitations, and will be updating our documentation to more clearly express this.
-This has recently hurt us quite badly. Thought I'd share until they get there API doc's upto date, so others can plan around this issue.
We recommend no more than 1 million objects per container. The system will return a maximum of 10,000 object names per list request by default.
Update 9/20/2013 from Cloud Files development: The 1 million object per container recommendation is no longer accurate since Cloud Files switched to all SSD container servers. Also, the list is limited to 10,000 containers at a time.