Is it possible to limit the file size or count when creating external table in azure data warehouse? - azure

While creating external tables in azure data warehouse,
files are generated in the location we set.
Currently we noticed that the file size and count are decided by the scale level of data warehouse.
I'm curious is it possible to set those values like, only generate not more than 100 files,
or each file should not greater than 5 GB.
Thanks.

So far it is not possible to set up those. it's all automatically decided by Azure DW.

Related

What is the max writeBatchSize for the REST as sink in Azure Data Factory

We are using Azure Data Factory to copy data from on-premise SQL table to a REST endpoint, for example, Google Cloud Storage. Our source table has more than 3 million of rows. Based the document https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#copy-activity-properties, the default value for writeBatchSize (number of records write to the REST sink per batch) is 10000. I tried to increase the size up to 5,000,000 and 1,000,000, and noticed the final file size are the same. It shows that not all the 3M records were written to GCS. Does anyone know what is max size for writeBatchSize? The pagination seems only for the case that using REST as source. I wonder if there is any workaround for my case?

Azure Data Lake incremental load with file partition

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2) How do I split or partition the files into smaller files?
3) How should I merge and load the deltas from source data into the files?
Thanks.
You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.
Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.
Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.
If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.
To summarize:
Load incrementally and name your incremental files intelligently
If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

How to speed up copy from Azure Data Lake to Cosmos DB

I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?
You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.
The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.

Azure SQL DW data loads taking long time

I am trying to load the data from my External Tables to SQL DW Internal tables. I have the data stores in a compressed format in BLOB Storage and External tables are pointed to the BLOB Storage Location.
I have around 24 files, which is around 22GB of size and trying to load the data from External table to a Internal table on 300 DWU with a largerc resource class service/user account.
My insert into statement ( which is very straight forward) is running for more than 10 hours.
insert into Trxdata.Details_data select * from Trxdata.Stage_External_Table_details_data;
I also tried with below statement, thats also running for more than 10 hours.
CREATE TABLE Trxdata.Details_data12
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT *
FROM Trxdata.Stage_External_Table_details_data
;
I see - both the SQLs are running with ACTIVE status in "sys"."dm_pdw_exec_requests" [ I was thinking, it may be concurrency slot issue and it hasnt got concurrency slots to run, but its not the case]
and I was hoping , increasing/scaling up DWU - might improve the performance. but looking at the DWU usage in portal.azure.com - I am not convinced to increased the DWU because the DWU usage chart shows <50DWU for the last 12 hours
DWU USage chart
So, I am trying to understand- how can I find - what is taking such a long time, How can I improve the performance of my data load ?
I suspect your problem lies with the file(s) being compressed. Many azure documents state that you will only get one reader per compressed file. As a test I would suggest you decompress your data and try a load and see if decompressing/load is faster than then 10 hours loading compressed data you are currently seeing. I also have better luck with several files rather than 1 large file, if that is an option for your system.
Please have a look at the below blog from SQL CAT on data loading optimizations.
https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/
Based on the info provided, a couple things to consider are:
1) Locality of the blob files compared to the DW instance. Make sure they are in the same region.
2) Clustered Columnstore is on by default. If you are loading 22GB of data, a HEAP load may perform better (but not sure on row count either). So:
CREATE TABLE Trxdata.Details_data12
WITH (HEAP, DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM Trxdata.Stage_External_Table_details_data ;
If the problem still persists, please file a support ticket:
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-create-support-ticket/
You mention that the data is in a compressed format. How many compressed files does the data reside in? For compressed files, you'll achieve more parallelism and thus better performance when the data is spread across many files. Having the data in multiple files is not needed for uncompressed files in order to achieve better performance, so another way to test if this is your performance issue is to un-compress your files.

Are there any limits on the number of Azure Storage Tables allowed in one account?

I'm currently trying to store a fairly large and dynamic data set.
My current design is tending towards a solution where I will create a new table every few minutes - this means every table will be quite compact, it will be easy for me to search my data (I don't need everything in one table) and it should make it easy for me to delete stale data.
I've looked and I can't see any documented limits - but I wanted to check:
Is there any limit on the number of tables allowed within one Azure storage account?
Or can I keep adding potentially thousands of tables without any concern?
There are no published limits to the number of tables, only the 100TB 500TB limit on a given storage account. Combined with partition+row, it sounds like you'll have a direct link to your data without running into any table-scan issues.
This MSDN article explicitly calls out: "You can create any number of tables within a given storage account, as long as each table is uniquely named." Have fun!

Resources