Azure Data Factory copy data is slow

Azure Data Factory copy data is slow - azure

Source database: PostgreSQL hosted on Azure VM D16s_v3
Destination database: SQL Server developer edition hosted on Azure VM D4s_v3
Source database is around 1TB in size
Destination database is empty with existing schema identical to source database
Throughput is only 1mb/s. Nothing helps. (I've selected max DIU) SQL Server doesn't have any keys or indexes at this point.
Batch size is 10000
See screenshot:

I got nailed by something similar when using ADF to copy data from an on-premises Oracle source to an Azure SQL Database sink. The same exact job performed via SSIS was something like 5 times faster. We began to suspect that something was amiss with data types, because the problem disappeared if we cast all of our high-precision Oracle NUMBER columns to less precision, or to something like integer.
It got so bad that we opened a case with Microsoft about it, and our worst fears were confirmed.
The Azure Data Factory runtime decimal type has a maximum precision of 28. If a decimal/numeric value from the source has a higher precision, ADF will first cast it to a string. The performance of the string casting code is abysmal.
Check to see if your source has any high-precision numeric data, or if you have not explicitly defined schema, see if you're perhaps accidentally using string.

Increase the batch size to 1000000.

If you are using TableName option then you should have that Table inside Dataset dropdown box. If you are extracting using SQL query then please check inside Dataset connection, click on edit and remove table name.
I had hit the same issue. If you select the query option and provide tablename in dataset, then you are confusing Azure Datafactory and making it ambiguous to decide on which option.

Related

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.

To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.

When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

How do I store run-time data in Azure Data Factory between pipeline executions?

I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.
It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.
It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.
In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?

I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.
In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.
The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.
Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.
I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!

There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.
Dataset setting:
Copy activity setting:
Source and sink dataset is the same one. Change the expression in additional columns to #{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:
{
"count": 1,
"value": [
{
"Prop_0": "11/24/2020 02:39:14",
"Prop_1": "11/24/2020 08:31:42"
}
]
}
The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: #activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]
How to get latest watermark:
use GetMetadata activity to get columnCount
use this expression:#activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]

Error- "The size (12000) given to the type 'VarChar' exceeds the maximum allowed (8000)" in Azure dataware house

I am trying to execute T-SQL in Azure Dataware house and it's not allowing me to have data type greater then varchar(8000), can anyone please suggest some alternative to this.
(Same issue happened on table creation as well , it doesn't support blob or LOB datatype even with Bulk Loading or poly base loading, so i ended up loading trimmed data.)

You can try VARCHAR(MAX), which will support up to 2GB, but the page size in SQL is still limited to 8,000 so I'm not sure if that will help or not. And Polybase is limited to 1MB per row. Here is another useful SO entry

Impossible to process multiple tables with ODBC connection in SSAS Tabular 2017

I'm currently building a cube in SSAS Tabular with compatibility level 1400 (on an Azure workspace server) and here is my problem. I have an ODBC connection to source my cube and I have to use a connection string and a SQL query for each tables I need (the connection string is always the same and the SQL query is always different).
When I have my first table (and only one table), I can Build, Process and Deploy easily without any problem. But, when I add a new table, I can't process anymore. I have that kind of message for both tables : Failed to save modifications to the server. Error returned: 'Column' column does not exist in the rowset.
I think the problem comes from the connection string which is the same for every table. I only have one Data source at the end because I only have one connection string for every table. In my opinion, it might be the cause of my problem but I'm not sure about that. Any idea ?
I hope I made myself clear.
Thanks a lot.

I found the solution to my problem. It was not related to my data source but to the table properties of each table.
Indeed, there was only the connection string in it and not the SQL query. I had to replace it with the correct M language query. It's still a bit strange because I had to make the same "Get Data" in Power BI to have the right M query and then copy and paste it to the table properties in SSAS. There should be a way to make it automatically I guess but I didn't find how.

Azure SQL Data Warehouse bandwidth limits?

Are there any bandwidth limits or throttling for Azure SQL Data Warehouse extracts? Are there any connection string settings that optimize how fast we can extract data via a SELECT query?
From SSIS on a VM in the same Azure region as SQL DW, if I run a SELECT * query to extract via OLEDB millions of rows with the default connection string (default packet size) I see it use about 55Mbps bandwidth. If I add Packet Size=32767 I see it use about 125Mbps bandwidth. Is there any way to make it go faster? Are there other connection string settings to be aware of?
Incidentally, I was able to get up to about 500Mbps bandwidth coming from SQL DW if I run multiple extracts in parallel. But I can't always break one query into several parallel queries. Sometimes I just need one query to extract the data faster.
Of course Polybase CETAS (CREATE EXTERNAL TABLE AS SELECT) is much more efficient at extracting data. But that's not a good fit in all extract scenarios. For example, if I want to put Analysis Services on top of Azure SQL DW, I can't really involve a CETAS statement during cube processing so Polybase doesn't help me there.

At the moment your best option is to run multiple extracts in parallel, optimizing the packet size as you have described. For SSAS on top of SQLDW your best option would be to use parallel partition processing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string