Data Factory Won't Load Large Column

Data Factory Won't Load Large Column - azure

We've got an Azure SQL table I'm moving from the traditional DB into a Azure Synapse Analytics (DW) table for long term storage and so it can be removed from our production DB. This is a system table for a deprecated system we used to use (Salesforce). I've got a column in this table in the DB that is a varchar(max), and its massive. The MAX(LEN(FIELD) is 1585521. I've tried using Data Factory to move the table into the DW, but it fails on that massive column. I modeled the DW table to be a mirror of the production DB table, but it fails to load and have tried several times. I changed the DW column that is failing to nvarchar(max), but its still failing (thought it might be non-unicode causing the failure). Any ideas? Its confusing me because the data exists in our production DB, but won't be nice and peacefully move to our DW.
I've tried several times and have received these error messages (second one after changing the DW column from varchar(max) to nvarchar(max):
HadoopSqlException: Arithmetic overflow error converting expression to data type NVARCHAR."}
HadoopExecutionException: Too long string in column [-1]: Actual len = [4977]. MaxLEN=[4000]

Currently using Polybase has this limitation of 1mb and length of column is greater than that. The work around is to use bulk insert in copy activity of ADF or chunk the source data into 8K columns and load into target staging table with 8K columns as well. Check this document for more details on this limitation.
If you're using PolyBase external tables to load your tables, the defined length of the table row can't exceed 1 MB. When a row with variable-length data exceeds 1 MB, you can load the row with BCP, but not with PolyBase.
It worked for me when I used "bulk insert" option in the ADF pipeline.

Related

Azure Data Factory Truncate prescript taking too long as data size increases

I am using Azure Data Factory to load data from the Azure SQL server into a snowflake data warehouse. I am using the truncate table script in the pre-script section of sink data to completely remove the data from the destination and insert all of the data that's in the source table. It's fast when the data size is small but once the data size gets really big the entire copy activity takes hours to complete the sync. What other alternatives can I use to copy data from my source to the destination?
I thought about using Upsert instead of truncating and inserting but since it's going to check each record I assumed that it will be slower.

Row limit issue when connecting Excel to Azure Synapse

There is a use case in my company to enable business users with no technical knowledge to use the data from Azure cloud. Back in the SQL server days this was easily solved through OLAP cubes. You could write a query for data that's backing up the cube, and then business people could just connect to the cube and data was downloaded as a pivot table, the only problem with large datasets there was compute (the larger the data, the slower the pivot table) but not really the row limit.
With the current Azure Synapse set up it seems that Excel is trying to download the entire data set and obviously always hits a 1M row limit. Is there anyway to directly use the data in the pivot table without bringing it in full to Excel? Because all my tables are >1M rows.
UPD: You can load data directly to Pivot, but it does load the data to RAM and the actual loading takes time. I am looking for a similar to cube solution, where the pivot table is available immediately and the querying happens once you're adding fields and calculations to the pivot table.

Unusual large table size in one data warehouse while with the same table, its size is significantly smaller in another data warehouse

Currently, we have a strange issue in our data warehouse (Azure Synapse Analytics) as 1 table in production instance has 52 GB in size with 18 million records. I copied that table to our development instance (I exported the mentioned table to csv file in ADLS GEN2 and copy it into our development data warehouse using ADF) to check why this table has large size and causing store procedures to run slow.
Strangely, the table size is just 17 GB while the tables between the two instances are the same in row count, contents and DDL. The two Data warehouses have the same DWU and other specifications. As I do not have much permission to research on production instance and I can not replicate the same table size on development instance.
Can someone help me to troubleshoot this issue or guide me in the right direction to rectify this ?
Kind regards,
Ken

Please recheck the process that you have followed for this and if it results same you can contact MS support team.
But before that I would suggest you try to copy the table directly from one synapse to another using the Copy activity in ADF and check the size of the table.
Please refer this Official Microsoft Documentation for that.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.

The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

How to insert Data into the Computed Columns using ADF V1

How can we insert data into the computed columns using ADF V1. Currently we cannot insert the data into the Azure SQL Database.

Suppose the source system is a JDBC connection and the target system is an Azure Sql database table that has a computed column (say extracting year from a date string "20181022"). When ADF tries to load the table it complains about the column mismatch in with the target system.
Is there a way to get around this?

I have a similar situation and also looking for a better solution.
I am copying data from a prod db and updating the QA db with it.
Here's how I have been handling it so far: modify the hashbytes column in the target QA db table so that it is a varchar. But this solution is less than ideal because this will have to be done manually every single time a code change is pushed to the affected table.
Your situation is different.
If it were me, I would do a straight copy of the data into the Azure SQL Database. No computed tables. No modifications. Then I would stage the data into another table that had the computed columns I was after.
It is an Extract, Load, Transform approach, compared to the traditional Extract, Transform, Load method.
This question is really old, but hey, I found this question just now; maybe other will, too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Data Factory Won't Load Large Column - azure

Related

Azure Data Factory Truncate prescript taking too long as data size increases

Row limit issue when connecting Excel to Azure Synapse

Unusual large table size in one data warehouse while with the same table, its size is significantly smaller in another data warehouse

Azure Data Factory DataFlow Filter is taking a lot of time

How to insert Data into the Computed Columns using ADF V1

Categories

Resources