Azure SQL Data Warehouse bandwidth limits? - azure

Are there any bandwidth limits or throttling for Azure SQL Data Warehouse extracts? Are there any connection string settings that optimize how fast we can extract data via a SELECT query?
From SSIS on a VM in the same Azure region as SQL DW, if I run a SELECT * query to extract via OLEDB millions of rows with the default connection string (default packet size) I see it use about 55Mbps bandwidth. If I add Packet Size=32767 I see it use about 125Mbps bandwidth. Is there any way to make it go faster? Are there other connection string settings to be aware of?
Incidentally, I was able to get up to about 500Mbps bandwidth coming from SQL DW if I run multiple extracts in parallel. But I can't always break one query into several parallel queries. Sometimes I just need one query to extract the data faster.
Of course Polybase CETAS (CREATE EXTERNAL TABLE AS SELECT) is much more efficient at extracting data. But that's not a good fit in all extract scenarios. For example, if I want to put Analysis Services on top of Azure SQL DW, I can't really involve a CETAS statement during cube processing so Polybase doesn't help me there.

At the moment your best option is to run multiple extracts in parallel, optimizing the packet size as you have described. For SSAS on top of SQLDW your best option would be to use parallel partition processing.

Related

Slow bulk insert to Azure database

We are running an elastic pool in Azure running multiple databases, when running 1 of our larger imports this seems to take longer than we are used to. During these imports we ran at 6 cores as a test. All databases are allowed to use all cores.
On our local enviroment, it inserts about 100k records per second, however, the same dataset on Azure does about 1k per second (our vm) to 4k per second (dev laptop).
During this insert, the database only uses 14% log IO, 5% CPU and 0% DataIO.
When setting up a new database using DTU model in P2 we are noticing the same experience. So we are not even hitting the limits of the database
The table contains about 36 columns which are all required.
We have tried this using BulkInsert in the following way using different batchsizes
BulkConfig b = new BulkConfig();
b.BatchSize = 100000;
await dbcontext.BulkInsertAsync(entities, b);
As well as using standard EntityFramework addranges using smaller batches. We even went as far as using the manually written SqlBulkCopy methods, however all with no dice.
Now the question is mainly, is this a software issue? Are we running into issues in our AzureDB? Do we need to change the way we do Bulk imports?
Edit:
Attempted to run the import using the TempDB Setting in BulkInsert, however this also does not increase performance. LogIO is still at 14%.
Iterate through the dataset on the application layer, invoking a
stored procedure for each row that will perform an INSERT/UPDATE
action based on the existence of a record with a certain key. If the
number of records to upsert is limited, this strategy may work well;
otherwise, roundtrips and log writes will have a major influence on
speed.
To minimise roundtrips and log writes and increase throughput, use
bulk insert approaches like the SqlBulkCopy class in ADO.NET to
upload the full dataset to Azure SQL Database and then execute all
the INSERT/UPDATE (or MERGE) operations in a single batch. Overall
execution times may be reduced from hours to minutes/seconds using
this method.
Here, is a discussion related to same scenario: Optimize Azure SQL Database Bulk Upsert scenarios - link.

Temporarily increase limits for .set ingest query in Azure Data Explorer

We need to initially import and transform large amounts of data in Azure Data Explorer.
The transformation consists of several .set operations containing queries representing transformational steps.
When i run those queries, they exeed certain limits ADX imposes to protect the cluster.
The error message is
aggregation over string column exceeded the memory budget of 8GB
during evaluation
I know i can override those default limits for memory consumption per iterator and per node but this seems to only work for queries without ingestion commands.
When i want to run
set max_memory_consumption_per_query_per_node=68719476736;
.set async DestinationTable <| SourceTable | ...
ADX complains with
The incomplete fragment is unexpected.
Is there a way to temporarily increase those query limits for .set-ingest operations as well?
The limit you're hitting isn't configurable using the options you included above.
The best approach would be optimizing the query/command you're running - you can start with query best practices and considering splitting the single .set-or-append command into multiple ones (see "Remarks" here).
If these still don't help, I would recommend that you include the full command text here for further advice.

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.
To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.
When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

Data Factory V2 copy Data Activities and Data flow ETL

My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink
I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).

Azure Data Factory copy data is slow

Source database: PostgreSQL hosted on Azure VM D16s_v3
Destination database: SQL Server developer edition hosted on Azure VM D4s_v3
Source database is around 1TB in size
Destination database is empty with existing schema identical to source database
Throughput is only 1mb/s. Nothing helps. (I've selected max DIU) SQL Server doesn't have any keys or indexes at this point.
Batch size is 10000
See screenshot:
I got nailed by something similar when using ADF to copy data from an on-premises Oracle source to an Azure SQL Database sink. The same exact job performed via SSIS was something like 5 times faster. We began to suspect that something was amiss with data types, because the problem disappeared if we cast all of our high-precision Oracle NUMBER columns to less precision, or to something like integer.
It got so bad that we opened a case with Microsoft about it, and our worst fears were confirmed.
The Azure Data Factory runtime decimal type has a maximum precision of 28. If a decimal/numeric value from the source has a higher precision, ADF will first cast it to a string. The performance of the string casting code is abysmal.
Check to see if your source has any high-precision numeric data, or if you have not explicitly defined schema, see if you're perhaps accidentally using string.
Increase the batch size to 1000000.
If you are using TableName option then you should have that Table inside Dataset dropdown box. If you are extracting using SQL query then please check inside Dataset connection, click on edit and remove table name.
I had hit the same issue. If you select the query option and provide tablename in dataset, then you are confusing Azure Datafactory and making it ambiguous to decide on which option.

Resources