I know you can use polybase using external table to load large volume of data from Blob Storage to Azure SQL DW. But is there any possibility that we can import the data from SQL DW to another SQL DW using polybase directly? Or is there some other way? There must be some way to avoid control node in both SQL DW.
You might be better off using Azure Data Factory to move data between two Azure SQL Data Warehouses. It would make light work of moving the data, but beware any data movement costs, particularly moving across region. Start here. Check the 'Use Polybase' checkbox.
If you do just want to use Polybase and Blob Storage, then you would have to:
first export the data from the source system internal tables to blob store using CETAS.
in the target system create external tables over the files in blob store
n the target system import the data from the external tables into the database using CTAS.
As far as I know, you have to use PolyBase and either Blob Storage or Data Lake Store to get the maximum throughput (bypass the control node)
You can create a new SQL DW from a geo-backup which should be a complete copy of the SQL DW with a 24 hour SLA. First click on create new SQL DW and select backup as an option as opposed to blank or sample.
Related
I am aware that in ADF copy activity can be used to load data from ADLS to Azure SQL DB.
Is there any possibility of bulk loading.
For example, ADLS --> Synapse have to option of PolyBase for bulk loading.
Is there any efficient way to load huge number of records from ADLS to Azure SQL DB.
Thanks
Madhan
You can use either BULK INSERT or OPENROWSET to get data from blob storage into Azure SQL Database. A simple example with OPENROWSET:
SELECT *
FROM OPENROWSET (
BULK 'someFolder/somecsv.csv',
DATA_SOURCE = 'yourDataSource',
FORMAT = 'CSV',
FORMATFILE = 'yourFormatFile.fmt',
FORMATFILE_DATA_SOURCE = 'MyAzureInvoices'
) AS yourFile;
A simple example with BULK INSERT:
BULK INSERT yourTable
FROM 'someFolder/somecsv.csv'
WITH (
DATA_SOURCE = 'yourDataSource',
FORMAT = 'CSV'
);
There is some setup to be done first, ie you have to use the CREATE EXTERNAL DATA SOURCE statement, but I find it a very effective way of getting data in Azure SQL DB without the overhead of setting up an ADF pipeline. It's especially good for ad hoc loads.
This article talks the steps through in more detail:
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15
Data Factory has the good performance for big data transferring, ref: Copy performance and scalability achievable using ADF. You could follow this document to improve the copy performance for the huge number of records in ADLS. I think it may be better than BULK INSERT.
We can not use BULK INSERT (Transact-SQL) directly in Data Factory. But we can using bulk copy for ADLS to Azure SQL database. Data Factory gives us the tutorial and example.
Ref here: Bulk copy from files to database:
This article describes a solution template that you can use to copy
data in bulk from Azure Data Lake Storage Gen2 to Azure Synapse
Analytics / Azure SQL Database.
Hope it's helpful.
We have Data warehouse tables that we perform transformations using ADF.
If I have a group of ADW tables, and I need to perform transformations on them to land them back onto ADW, should I save the transformations into Azure Blob Storage? or go direct into the target table.
The ADW tables are in excess of 100Mil records.
Is it an acceptable practice to use Blob Storage as the middle piece.
I can think of two ways to do this (they do not require moving the data into blob storage),
Do the transformation within SQL DW using stored procedure and use ADF to orchestrate the stored procedure call
Use ADF's data flow to apply the transformation to read from SQL DW and write back to SQL DW
Yes, you'd better using the use Blob Storage as the middle piece.
You can not copy the tables from SQL DW(Source) to the same SQL DW(Sink) directly! If you have tried this, you will have the problems:
Copy data: has the error in data mapping, copy data to the same table, not create new tales.
Copy Active: Table is required for Copy activity.
If you want to copy the data from SQL DW tables to new tables with Data Factor, you need at least two steps:
copy the data from the SQL DW tables to Blob storage(create the csv
files).
Load these csv files to SQL DW and create new tables.
Reference tutorials:
Copy and transform data in Azure Synapse Analytics (formerly Azure
SQL Data Warehouse) by using Azure Data Factory
Copy and transform data in Azure Blob storage by using Azure Data
Factory
Data Factory is good at transfer big data. Reference Copy performance of Data Factory. I think it may faster than SELECT - INTO Clause (Transact-SQL).
Hope this helps.
I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.
I'm in a position where we're reading from our Azure Data Lake using external tables in Azure Data Warehouse.
This enables us to read from the data lake, using well known SQL.
However, another option is using Data Lake Analytics, or some variation of HDInsight.
Performance wise, I'm not seeing much difference. I assume Data Warehouse is running some form of distributed query in the background, converting to U-SQL(?), and so why would we use Data Lake Analytics with the slightly different syntax of U-SQL?
With python script also available in SQL, I feel I'm missing a key purpose of Data Lake Analytics, other than the cost (pay per batch job, rather than constant up time of a database).
If your main purpose is to query data stored in the Azure Data Warehouse (ADW) then there is not real benefit to using Azure Data Lake Analytics (ADLA). But as soon as you have other (un)structured data stored in ADLS, like json documents or csv files for example, the benefit of ADLA becomes clear as U-Sql allows you to join your relational data stored in ADW with the (un)structured / nosql data stored in ADLS.
Also, it enables you to use U-Sql to prepare this other data for direct import in ADW, so Azure Data Factory is not longer required to get the data into you data warehouse. See this blogpost for more information:
A common use case for ADLS and SQL DW is the following. Raw data is ingested into ADLS from a variety of sources. Then ADL Analytics is used to clean and process the data into a loading ready format. From there, the high value data can be imported into Azure SQL DW via PolyBase.
..
You can import data stored in ORC, RC, Parquet, or Delimited Text file formats directly into SQL DW using the Create Table As Select (CTAS) statement over an external table.
Please note that the SQL statement in SQL Data Warehouse is currently NOT generating U-SQL behind the scenes. Also, the use cases between ADLA/U-SQL and SDW are different.
ADLA is giving you an processing engine to do batch data preparation/cooking to generate your data to build a data mart/warehouse that you then can read interactively with SQL DW. In your example above, you seem to be mainly doing the second part. Adding "Views" on top on these EXTERNAL tables to do transformations in SQL DW will quickly run into scalability limits if you operating on big data (and not just a few 100k rows).
I have an on-prem Dat Warehouse using SQL Server, what is the best way to load the data to SQL Data Warehouse?
The process of loading data depends on the amount of data. For very small data sets (<100 GB) you can simply use the bulk copy command line utility (bcp.exe) to export the data from SQL Server and then import to Azure SQL Data Warehouse. For data sets greater than 100 GB, you can export your data using bcp.exe, move the data to Azure Blob Storage using a tool like AzCopy, create an external table (via TSQL code) and then pull the data in via a Create Table As Select (CTAS) statement.
Using the PolyBase/CTAS route will allow you to take advantage of multiple compute nodes and the parallel nature of data processing in Azure SQL Data Warehouse - an MPP based system. This will greatly improve the data ingestion performance as each compute node is able to process a block of data in parallel with the other nodes.
One consideration as well is to increase the amount of DWU (compute resources) available in SQL Data Warehouse at the time of the CTAS statement. This will increase the number of compute resources adding additional parallelism which will decrease the total ingestion time.
SQL database migration wizard is a helpful tool to migrate schema and data from an on-premise database to Azure sql databases.
http://sqlazuremw.codeplex.com/