How to perform data factory transformations on large datasets in Azure data warehouse - azure

We have Data warehouse tables that we perform transformations using ADF.
If I have a group of ADW tables, and I need to perform transformations on them to land them back onto ADW, should I save the transformations into Azure Blob Storage? or go direct into the target table.
The ADW tables are in excess of 100Mil records.
Is it an acceptable practice to use Blob Storage as the middle piece.

I can think of two ways to do this (they do not require moving the data into blob storage),
Do the transformation within SQL DW using stored procedure and use ADF to orchestrate the stored procedure call
Use ADF's data flow to apply the transformation to read from SQL DW and write back to SQL DW

Yes, you'd better using the use Blob Storage as the middle piece.
You can not copy the tables from SQL DW(Source) to the same SQL DW(Sink) directly! If you have tried this, you will have the problems:
Copy data: has the error in data mapping, copy data to the same table, not create new tales.
Copy Active: Table is required for Copy activity.
If you want to copy the data from SQL DW tables to new tables with Data Factor, you need at least two steps:
copy the data from the SQL DW tables to Blob storage(create the csv
files).
Load these csv files to SQL DW and create new tables.
Reference tutorials:
Copy and transform data in Azure Synapse Analytics (formerly Azure
SQL Data Warehouse) by using Azure Data Factory
Copy and transform data in Azure Blob storage by using Azure Data
Factory
Data Factory is good at transfer big data. Reference Copy performance of Data Factory. I think it may faster than SELECT - INTO Clause (Transact-SQL).
Hope this helps.

Related

Is it possible to access Databricks DBFS from Azure Data Factory?

I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?

How to bulkload Azure SQLDB from ADLS

I am aware that in ADF copy activity can be used to load data from ADLS to Azure SQL DB.
Is there any possibility of bulk loading.
For example, ADLS --> Synapse have to option of PolyBase for bulk loading.
Is there any efficient way to load huge number of records from ADLS to Azure SQL DB.
Thanks
Madhan
You can use either BULK INSERT or OPENROWSET to get data from blob storage into Azure SQL Database. A simple example with OPENROWSET:
SELECT *
FROM OPENROWSET (
BULK 'someFolder/somecsv.csv',
DATA_SOURCE = 'yourDataSource',
FORMAT = 'CSV',
FORMATFILE = 'yourFormatFile.fmt',
FORMATFILE_DATA_SOURCE = 'MyAzureInvoices'
) AS yourFile;
A simple example with BULK INSERT:
BULK INSERT yourTable
FROM 'someFolder/somecsv.csv'
WITH (
DATA_SOURCE = 'yourDataSource',
FORMAT = 'CSV'
);
There is some setup to be done first, ie you have to use the CREATE EXTERNAL DATA SOURCE statement, but I find it a very effective way of getting data in Azure SQL DB without the overhead of setting up an ADF pipeline. It's especially good for ad hoc loads.
This article talks the steps through in more detail:
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15
Data Factory has the good performance for big data transferring, ref: Copy performance and scalability achievable using ADF. You could follow this document to improve the copy performance for the huge number of records in ADLS. I think it may be better than BULK INSERT.
We can not use BULK INSERT (Transact-SQL) directly in Data Factory. But we can using bulk copy for ADLS to Azure SQL database. Data Factory gives us the tutorial and example.
Ref here: Bulk copy from files to database:
This article describes a solution template that you can use to copy
data in bulk from Azure Data Lake Storage Gen2 to Azure Synapse
Analytics / Azure SQL Database.
Hope it's helpful.

Azure Data Lake - HDInsight vs Data Warehouse

I'm in a position where we're reading from our Azure Data Lake using external tables in Azure Data Warehouse.
This enables us to read from the data lake, using well known SQL.
However, another option is using Data Lake Analytics, or some variation of HDInsight.
Performance wise, I'm not seeing much difference. I assume Data Warehouse is running some form of distributed query in the background, converting to U-SQL(?), and so why would we use Data Lake Analytics with the slightly different syntax of U-SQL?
With python script also available in SQL, I feel I'm missing a key purpose of Data Lake Analytics, other than the cost (pay per batch job, rather than constant up time of a database).
If your main purpose is to query data stored in the Azure Data Warehouse (ADW) then there is not real benefit to using Azure Data Lake Analytics (ADLA). But as soon as you have other (un)structured data stored in ADLS, like json documents or csv files for example, the benefit of ADLA becomes clear as U-Sql allows you to join your relational data stored in ADW with the (un)structured / nosql data stored in ADLS.
Also, it enables you to use U-Sql to prepare this other data for direct import in ADW, so Azure Data Factory is not longer required to get the data into you data warehouse. See this blogpost for more information:
A common use case for ADLS and SQL DW is the following. Raw data is ingested into ADLS from a variety of sources. Then ADL Analytics is used to clean and process the data into a loading ready format. From there, the high value data can be imported into Azure SQL DW via PolyBase.
..
You can import data stored in ORC, RC, Parquet, or Delimited Text file formats directly into SQL DW using the Create Table As Select (CTAS) statement over an external table.
Please note that the SQL statement in SQL Data Warehouse is currently NOT generating U-SQL behind the scenes. Also, the use cases between ADLA/U-SQL and SDW are different.
ADLA is giving you an processing engine to do batch data preparation/cooking to generate your data to build a data mart/warehouse that you then can read interactively with SQL DW. In your example above, you seem to be mainly doing the second part. Adding "Views" on top on these EXTERNAL tables to do transformations in SQL DW will quickly run into scalability limits if you operating on big data (and not just a few 100k rows).

Azure SQL DW to Azure SQL DW using Polybase

I know you can use polybase using external table to load large volume of data from Blob Storage to Azure SQL DW. But is there any possibility that we can import the data from SQL DW to another SQL DW using polybase directly? Or is there some other way? There must be some way to avoid control node in both SQL DW.
You might be better off using Azure Data Factory to move data between two Azure SQL Data Warehouses. It would make light work of moving the data, but beware any data movement costs, particularly moving across region. Start here. Check the 'Use Polybase' checkbox.
If you do just want to use Polybase and Blob Storage, then you would have to:
first export the data from the source system internal tables to blob store using CETAS.
in the target system create external tables over the files in blob store
n the target system import the data from the external tables into the database using CTAS.
As far as I know, you have to use PolyBase and either Blob Storage or Data Lake Store to get the maximum throughput (bypass the control node)
You can create a new SQL DW from a geo-backup which should be a complete copy of the SQL DW with a 24 hour SLA. First click on create new SQL DW and select backup as an option as opposed to blank or sample.

Azure Data Factory: Moving data from Table Storage to SQL Azure

While moving data from Table Storage to SQL Azure, is it possible to obtain only the Delta (The data that hasn't been already moved) using Azure Data Factory?
A more detailed explanation:
There is an Azure Storage Table, which contains some data, which will be updated periodically. And I want to create a Data Factory pipeline which moves this data to an SQL Azure Database. But during each move I only want the newly added data to be written to SQL DB. Is it possible with Azure Data Factory?
See more information on azureTableSourceQuery and copy activity at this link : https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-table-connector/#azure-table-copy-activity-type-properties.
Also see this link for invoking stored procedure for sql: https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-sql-connector/#invoking-stored-procedure-for-sql-sink
You can query each time on timestamp to achieve something similar to delta copy, but this is not true delta copy.

Resources