I am newbie to Azure Data Factory. I'm trying to load multiple files of various states from FTP location into a single Azure SQL Server table. My requirement is to get state name from of the file and dump it into table along with actual data.
Currently, my source is FTP. Sink is Azure SQL Server table. I have used Stored Procedure to load the data. However, I'm unable to send file name as a parameter as shown below to the stored procedure so that I can dump it into the table. Below is the Copy Data component -
I have defined SourceFileName parameter in stored procedure, however, I am unable to send it via COPY Data activity.
Any help is appreciated.
We can conclude that additional column option can not be used here. Because ADF will return a column(contain filepath) not a string. So we need to use GetMetaData activity to get the file list, then foreach the file list and inside a Foreach activity to copy them.
I've created a simple test, it works well.
In my local FTP server, there is two text files. I need to copy them into an Azure SQL table.
At GetMetaData activity, I use Child Items to get the filelist.
At ForEach activity, I use #activity('Get Metadata1').output.childItems to foreach the file list.
Inside the ForEach activity, I use dynamic content #item().name to get the file path.
source setting:
sink setting:
So we can get the filename. Follows are some operations I did on Azure SQL.
-- create a table
CREATE TABLE [dbo].[employee](
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL,
[filePath] [varchar](50) NULL
) ON [PRIMARY]
GO
-- create a table type
CREATE TYPE [dbo].[ct_employees_type] AS TABLE(
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL
)
GO
-- create a Stored procedure
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[spUpsertEmployees]
#employees ct_employees_type READONLY,
#filePath varchar(50)
AS
BEGIN
set #filePath = SUBSTRING( #filePath,1,len(#filePath)-4)
MERGE [dbo].[employee] AS target_sqldb
USING #employees AS source_tblstg
ON (target_sqldb.firstName = source_tblstg.firstName)
WHEN MATCHED THEN
UPDATE SET
firstName = source_tblstg.firstName,
lastName = source_tblstg.lastName
WHEN NOT MATCHED THEN
INSERT (
firstName,
lastName,
filePath
)
VALUES (
source_tblstg.firstName,
source_tblstg.lastName,
#filePath
);
END
GO
After I run debug, the result is follows:
Related
Using Azure Synapse , Dedicated SQL pool.
How can I structure my tables underneath a button that represents the schema?
This is a small issue that really makes a big impact when many tables and schemas will be used in the database, and users will need to navigate to the correct schema quickly.
I tried dragging the schema under over the tables section, but nothing worked.
to structure the table first we need to create schema in sql dedicated pool using below command
CREATE SCHEMA <schemaName>
we need to create table using above created schema with required columns with suitable data types to that column.
Table creation:
CREATE TABLE <tableName>(col1 dataType,col2 dataType)
I created external table in dedicated sql pool in synapse following below steps:
Schema creation:
Created external data source:
CREATE EXTERNAL DATA SOURCE [DATASOURCE] WITH
(
LOCATION = '<location>',
)
Image for reference:
created external file format:
CREATE EXTERNAL FILE FORMAT [FileFormat1] WITH
(
FORMAT_TYPE = DELIMITEDTEXT
)
Image for reference:
I created external table using above data source and file format using below code:
CREATE EXTERNAL TABLE [wwi].[information2]
(
[Id] INT
)
WITH
(
LOCATION = '<folder/file>',
DATA_SOURCE = [DATASOURCE1],
FILE_FORMAT = [FileFormat1]
)
Image for reference:
In this we can structure the table in synapse dedicated pool.
I am using Azure Synapse to query a large number of CSV files with the OPENROWSET command see here. The files are located on a Data Lake gen 2 connected to the Azure Synapse via a managed identity.
This is working fine when I am only querying a few files at a time, however when I increase the number of files which I am trying to query simultaneousally I am getting the following error:
Azure Synapse: Cannot bulk load because the file <file> could not be opened. Operating system error code 12(The access code is invalid.)
Here <file> is a different file each time I run the query. If I navigate to the file in the linked data view I can download and view the file. Also if I specify to run the query on the file mentioned previousally in an error it will work fine.
The code I am using to query the data lake is below:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
Here the data source, analytics is a data source specified as follows:
CREATE EXTERNAL DATA SOURCE analytics
WITH
(
location = 'https://<url>.dfs.core.windows.net/analytics'
)
I have tried specifying a large number for the MAXERRORS parameter for BULK in OPENROWSET as I don't mind if only a few files are missed in executing this query, however this only seems to work at the row level for errors and these errors are at the file level.
The query is running here on the built-in serverless pool.
Any ideas on how to get around this issue would be appreciated.
Your code is doing pass through authentication to storage for whatever AAD user is connected to Synapse serverless (which will fail if you are using a SQL login). To use the MSI to connect to storage you will need a database scoped credential and will need to reference it in the external data source as in this example.
-- Optional: Create MASTER KEY if not exists in database:
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Very Strong Password>
CREATE DATABASE SCOPED CREDENTIAL SynapseIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE mysample
WITH ( LOCATION = 'https://<storage_account>.dfs.core.windows.net/<container>/<path>',
CREDENTIAL = SynapseIdentity
)
Also see the sections in that article about the firewall on your storage account if you have that locked down.
Just adding a quick answer to this. After making the changes which Greg suggested, I was able to query a little more data - but still getting hit with the error code 12.
I spoke with Azure support and they advised me that the error message was actually 412 (but it was not visable to me); so this implied file in use/being modified. Adding the following allowed Azure Synapse to ignore this and query the file regardless:
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
or for external tables:
TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
This made my final query:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics_master_key',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
I'm trying to do a simple incremental update from an on-prem database as source to Azure SQL database based on a varchar column called "RP" in On-Prem database that contains "date+staticdescription" for example: "20210314MetroFactory"
1- I've created a Lookup activity called Lookup1 using a table created in Azure SQL Database and uses this Query
"Select RP from SubsetwatermarkTable"
2- I've created a Copy data activity where the source settings have this Query
"Select * from SourceDevSubsetTable WHERE RP NOT IN '#{activity('Lookup1').output.value}'"
When debugging -- I'm getting the error:
Failure type: User configuration issue
Details: Failure happened on 'Source' side.
'Type=System.Data.SqlClient.SqlException,Message=Incorrect syntax near
'[{"RP":"20210307_1Plant
1KAO"},{"RP":"20210314MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"2'.,Source=.Net
SqlClient Data
Provider,SqlErrorNumber=102,Class=15,ErrorCode=-2146232060,State=1,Errors=[{Class=15,Number=102,State=1,Message=Incorrect
syntax near
'[{"RP":"20210311MetroFactory"},{"RP":"20210311MetroFactory"},{"RP":"202103140MetroFactory"},{"RP":"20210308MetroFactory"},{"RP":"2'.,},],'
Can anyone tell me what I am doing wrong and how to fix it even if it requires creating more activities.
Note: There is no LastModifiedDate column in the table. Also I haven't yet created the StoredProcedure that will update the Lookup table when it is done with the incremental copy.
Steve is right as to why it is failling and the query you need in the Copy Data.
As he says, you want a comma-separated list of quoted values to use in your IN clause.
You can get this more easily though - from your Lookup directly using this query:-
select stuff(
(
select ','''+rp+''''
from subsetwatermarktable
for xml path('')
)
, 1, 1, ''
) as in_clause
The sub-query gets the comma separated list with quotes around each rp-value, but has a spurious comma at the start - the outer query with stuff removes this.
Now tick the First Row Only box on the Lookup and change your Copy Data source query to:
select *
from SourceDevSubsetTable
where rp not in (#{activity('lookup').output.firstRow.in_clause})
The result of #activity('Lookup1').output.value is an array like your error shows
[{"RP":"20210307_1Plant
1KAO"},{"RP":"20210314MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"2'.,Source=.Net
SqlClient Data
Provider,SqlErrorNumber=102,Class=15,ErrorCode=-2146232060,State=1,Errors=[{Class=15,Number=102,State=1,Message=Incorrect
syntax near
'[{"RP":"20210311MetroFactory"},{"RP":"20210311MetroFactory"},{"RP":"202103140MetroFactory"},{"RP":"20210308MetroFactory"},{"RP":"2'.,},]
However, your SQL should be like this:Select * from SourceDevSubsetTable WHERE RP NOT IN ('20210307_1Plant 1KAO','20210314MetroFactory',...).
To achieve this in ADF, you need to do something like this:
create three variables like the following screenshot:
loop your result of #activity('Lookup1').output.value and append 'item().RP' to arrayvalues:
expression:#activity('Lookup1').output.value
expression:#concat(variables('apostrophe'),item().RP,variables('apostrophe'))
3.cast arrayvalues to string and add parentheses by Set variable activity
expression:#concat('(',join(variables('arrayvalues'),','),')')
4.copy to your Azure SQL database
expression:Select * from SourceDevSubsetTable WHERE RP NOT IN #{variables('stringvalues')}
I am trying to execute this query but as userdefined(Create type) types are not supportable in azure data warehouse. and i want to use it in stored procedure.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
CREATE PROCEDURE usp_upsert_customer_table #customer_table DataTypeforCustomerTable READONLY
AS
BEGIN
MERGE customer_table AS target
USING #customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
CREATE TYPE DataTypeforProjectTable AS TABLE(
Project varchar(255),
Creationtime datetime
);
GO
CREATE PROCEDURE usp_upsert_project_table #project_table DataTypeforProjectTable READONLY
AS
BEGIN
MERGE project_table AS target
USING #project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Is there any alternative way to do this.
You've got a few challenges there, because most of what you're trying to convert is not the way to do things on ASDW.
First, as you point out, CREATE TYPE is not supported, and there is no equivalent alternative.
Next, the code appears to be doing single inserts to a table. That's really bad on ASDW, performance will be dreadful.
Next, there's no MERGE statement (yet) for ASDW. That's because UPDATE is not the best way to handle changing data.
And last, stored procedures work a little differently on ASDW, they're not compiled, but interpreted each time the procedure is called. Stored procedures are great for big chunks of table-level logic, but not recommended for high volume calls with single-row operations.
I'd need to know more about the use case to make specific recommendations, but in general you need to think in tables rather than rows. In particular, focus on the CREATE TABLE AS (CTAS) way of handling your ELT.
Here's a good link, it shows how the equivalent of a Merge/Upsert can be handled using a CTAS:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-ctas#replace-merge-statements
As you'll see, it processes two tables at a time, rather than one row. This means you'll need to review the logic that called your stored procedure example.
If you get your head around doing everything in CTAS, and separately around Distribution, you're well on your way to having a high performance data warehouse.
Temp tables in Azure SQL Data Warehouse have a slightly different behaviour to box product SQL Server or Azure SQL Database - they exist at the session level. So all you have to do is convert your CREATE TYPE statements to temp tables and split the MERGE out into separate INSERT / UPDATE / DELETE statements as required.
Example:
CREATE TABLE #DataTypeforCustomerTable (
PersonID INT,
Name VARCHAR(255),
LastModifytime DATETIME
)
WITH
(
DISTRIBUTION = HASH( PersonID ),
HEAP
)
GO
CREATE PROCEDURE usp_upsert_customer_table
AS
BEGIN
-- Add records which do not already exist
INSERT INTO customer_table ( PersonID, Name, LastModifytime )
SELECT PersonID, Name, LastModifytime
FROM #DataTypeforCustomerTable AS source
WHERE NOT EXISTS
(
SELECT *
FROM customer_table target
WHERE source.PersonID = target.PersonID
)
...
Simply load the temp table and execute the stored proc. See here for more details on temp table scope.
If you are altering a large portion of the table then you should consider the CTAS approach to create a new table, then rename it as suggested by Ron.
I have an Azure data factory (DF) pipeline that consists a Copy activity. The Copy activity uses HTTP connector as source to invoke a REST end-point and returns csv stream that sinks with Azure SQL Database table.
The Copy fails when CSV contains strings (such as 40f52caf-e616-4321-8ea3-12ea3cbc54e9) which are mapped to an uniqueIdentifier field in target table with error message The given value of type String from the data source cannot be converted to type uniqueidentifier of the specified target column.
I have tried to wrapped the source string with {} such as {40f52caf-e616-4321-8ea3-12ea3cbc54e9} with no success.
The Copy activity will work if I modified the target table field from uniqueIdentifier to nvarchar(100).
I reproduce your issue on my side.
The reason is data types of source and sink are dismatch.You could check the Data type mapping for SQL server.
Your source data type is string which is mapped to nvarchar or varchar, and uniqueidentifier in sql database needs GUID type in azure data factory.
So,please configure sql server stored procedure in your sql server sink as a workaround.
Please follow the steps from this doc:
Step 1: Configure your Sink dataset:
Step 2: Configure Sink section in copy activity as follows:
Step 3: In your database, define the table type with the same name as sqlWriterTableType. Notice that the schema of the table type should be same as the schema returned by your input data.
CREATE TYPE [dbo].[CsvType] AS TABLE(
[ID] [varchar](256) NOT NULL
)
Step 4: In your database, define the stored procedure with the same name as SqlWriterStoredProcedureName. It handles input data from your specified source, and merge into the output table. Notice that the parameter name of the stored procedure should be the same as the "tableName" defined in dataset.
Create PROCEDURE convertCsv #ctest [dbo].[CsvType] READONLY
AS
BEGIN
MERGE [dbo].[adf] AS target
USING #ctest AS source
ON (1=1)
WHEN NOT MATCHED THEN
INSERT (id)
VALUES (convert(uniqueidentifier,source.ID));
END
Output:
Hope it helps you.Any concern,please free feel to let me know.
There is a way to fix guid conversion into uniqueidentifier SQL column type properly via JSON configuration.
Edit the Copy Activity via Code {} button in top right toolbar.
Put:
"translator": {
"type": "TabularTranslator",
"typeConversion": true
}
into typeProperties block of the Copy activity. This will also work if Mapping schema is unspecified / dynamic.