data factory copy data without explicitly creating a target table - azure

I am working on copying data from a source Oracle database to a Target SQL data warehouse using the Data factory.
When using the copy function in data factory, we are asked to specify the destination location and a table to copy the data to. There are multiple tables that needs to be copied, and therefore making a table for each in the destination is time consuming.
How can I setup data factory to copy data from the source to a destination, where it will automatically create a table at the destination, without having to explicitly create them manually?
TIA

Came across the same issue last year, used pipeline.parameters() for dynamic naming and a Data Factory stored procedure activity before the copy activity to first create the empty table from a template before copying https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-stored-procedure.
CREATE PROCEDURE create_sql_table_proc #WindowStartYear NVARCHAR(30), #WindowStartMonth NVARCHAR(30), #WindowStartDay NVARCHAR(30)
AS
BEGIN
declare #strsqlcreatetable as [NVARCHAR](255)
declare #strsqldroptable as [NVARCHAR](255)
declare #tablename as [NVARCHAR](255)
declare #strsqlsetpk as [NVARCHAR](255)
select #tablename = 'TABLE_NAME_' + #WindowStartYear + #WindowStartMonth + #WindowStartDay
select #strsqldroptable = 'DROP TABLE IF EXISTS ' + #tablename
select #strsqlcreatetable = 'SELECT * INTO ' + #tablename + ' FROM OUTPUT_TEMPLATE'
select #strsqlsetpk = 'ALTER TABLE ' + #tablename + ' ADD PRIMARY KEY (CustID)'
exec (#strsqldroptable)
exec (#strsqlcreatetable)
exec (#strsqlsetpk)
END
Since have started pushing the table to SQL from our Pyspark scripts running on a cluster, where it is not necessary to first create the empty table https://medium.com/#radek.strnad/tips-for-using-jdbc-in-apache-spark-sql-396ea7b2e3d3.

Related

Incremental load Azure Data Factory

I'm trying to do an incremental load using ADF, but I got this error message, how to solve it and to do the incremental load in the right way?
Note:the Table name variable is defined through the stored procedure.
Error Message ADF
Stored Procedure:
ALTER PROCEDURE [INT].[usp_write_watermark]
#LastModifiedtime datetime, #TableName varchar(50)
AS
BEGIN
UPDATE [log].watermarktable
SET [WatermarkValue] = #LastModifiedtime
WHERE [TableName] = #TableName
END

Execute stored procedure in Azure Data Platform - Post SQL Scripts

Based on the documentation below,
https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database
There is a feature to run post SQL script. Would it be possible to run stored procedure from there?
I have tried, it does not seem to be working and currently investigating.
Thanks for your information in advance.
I created a test to prove that the stored procedure can be called in the Post SQL scripts.
I created two tables:
CREATE TABLE [dbo].[emp](
id int IDENTITY(1,1),
[name] [nvarchar](max) NULL,
[age] [nvarchar](max) NULL
)
CREATE TABLE [dbo].[emp_stage](
id int,
[name] [nvarchar](max) NULL,
[age] [nvarchar](max) NULL
)
I created a sotred procedure.
create PROCEDURE [dbo].[spMergeEmpData]
AS
BEGIN
SET IDENTITY_INSERT dbo.emp ON
MERGE [dbo].[emp] AS target
USING [dbo].[emp_stage] AS source
ON (target.[id] = source.[id])
WHEN MATCHED THEN
UPDATE SET name = source.name,
age = source.age
WHEN NOT matched THEN
INSERT (id, name, age)
VALUES (source.id, source.name, source.age);
TRUNCATE TABLE [dbo].[emp_stage]
END
I will copy the csv file into my Azure SQL staging table [dbo].[emp_stage], then use stored porcedure [dbo].[spMergeEmpData] to transfer data from [dbo].[emp_stage] to [dbo].[emp].
Enter the stored procedure name exec [dbo].[spMergeEmpData] in the Post SQL scripts field.
I successfully debugged.
I can see the data are all in TABLE [dbo].[emp].

Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 returns zero rows

Why does an Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 return many rows for a single file source, but zero rows for the parent folder source?
I created:
Master Key (CREATE MASTER KEY;)
Credential (CREATE DATABASE SCOPED
CREDENTIAL) - uses the ADLS Gen 2 account key
External data source (CREATE EXTERNAL DATA SOURCE)
File format (CREATE EXTERNAL FILE FORMAT)
External table (CREATE EXTERNAL TABLE)
Everything works fine when my external table points to a specific file, i.e.
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = '/Time/time001.txt',
DATA_SOURCE = ADLSDataSource,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];
Many rows returned, therefore I am confident all items mentioned above are configured correctly.
The Time folder in Azure Data Lake Gen 2 contains many files, not just time001.txt. When I change my external table to point at a folder, and not an individual file, the query returns zero rows, i.e.
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = '/Time/',
DATA_SOURCE = ADLSDataSource,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];
Zero rows returned
I tried:
LOCATION = '/Time/',
LOCATION = '/Time',
LOCATION = 'Time/',
LOCATION = 'Time',
But always zero rows. I also followed the instructions at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store
I tested all the files within the folder and individually each returns many rows of data.
I queried all the files from Blob storage and not ADLS Gen2 and the "Folder" query returns all rows as expected.
How do I query all files in a folder "as one" from Azure Data Lake Gen2 storage using Azure SQL Data Warehouse and Polybase?
I was facing the exactly same issue: the problem was on the Data Source protocol.
Script with error:
CREATE EXTERNAL DATA SOURCE datasourcename
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://container#storage.blob.core.windows.net',
CREDENTIAL = credential_name
Script that solves issue:
CREATE EXTERNAL DATA SOURCE datasourcename
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://container#storage.dfs.core.windows.net',
CREDENTIAL = credential_name
The only change needed was the LOCATION.
Thanks to the Microsoft Support Team for helping me on this.

update and insert into Azure data warehouse using Azure data factory pipelines

I'm trying to run an adf copy pipeline with and update and insert statements that is supposed to replace merge statement. basically a statement like:
UPDATE TARGET
SET ProductName = SOURCE.ProductName,
TARGET.Rate = SOURCE.Rate
FROM Products AS TARGET
INNER JOIN UpdatedProducts AS SOURCE
ON TARGET.ProductID = SOURCE.ProductID
WHERE TARGET.ProductName <> SOURCE.ProductName
OR TARGET.Rate <> SOURCE.Rate
INSERT Products (ProductID, ProductName, Rate)
SELECT SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate
FROM UpdatedProducts AS SOURCE
WHERE NOT EXISTS
(
SELECT 1
FROM Products
WHERE ProductID = SOURCE.ProductID
)
If the target is an azure sql db I would use this way: https://www.taygan.co/blog/2018/04/20/upsert-to-azure-sql-db-with-azure-data-factory
but if the target is an adw a stored procedure option doesn't exist! any suggestion? do I have to have a staging table first then I run the update and insert statements from stg_table to target_table? or maybe there is any possibility to do it directly from adf?
If you can't use a stored procedure, my suggestion would be to create a second copy data transform. Run the pre-script on the second transform and drop the table since its a temp table that you created on the first.
BEGIN
MERGE Target AS target_sqldb
USING TempTable AS source_tblstg
ON (target_sqldb.Id= source_tblstg.Id)
WHEN MATCHED THEN
UPDATE SET
[Name] = source_tblstg.Name,
[State] = source_tblstg.State
WHEN NOT MATCHED THEN
INSERT([Name], [State])
VALUES (source_tblstg.Name, source_tblstg.State);
DROP TABLE TempTable;
END

Extract data from Excel sheet to multiple SQL Server tables

I have an Excel file with a few columns (20) and some data that I need to upload into 4 SQL Server tables. The tables are related and specific columns represents my id for each table.
Is there an ETL tool that I can use to automate this process?
This query uses bulk insert to store the file in a #temptable
and then inserts the contents from this temp table into the table you want in the database, however the file being imported is .csv. you can just save your excel file as csv, before doing this.
CREATE TABLE #temptable (col1,col2,col3)
BULK INSERT #temptable from 'C:\yourfilelocation\yourfile.csv'
WITH
(
FIRSTROW = 2,
fieldterminator = ',',
rowterminator = '0x0A'
) `
INSERT INTO yourTableInDataBase (col1,col2,col3)
SELECT (col1,col2,col3)
FROM #temptable
To automate this, you can put this inside a stored procedure and call the stored procedure using batch.Edit this code and put this inside textfile and save as cmd
set MYDB= yourDBname
set MYUSER=youruser
set MYPASSWORD=yourpassword
set MYSERVER=yourservername
sqlcmd -S %MYSERVER% -d %MYDB% -U %MYUSER% -P %MYPASSWORD% -h -1 -s "," -W -Q "exec yourstoredprocedure"

Resources