Azure Data Factory with SP Activity - Debug and Publish fails - azure

I've created an Azure Data Factory pipeline with one simple Stored Procedure Activity which is supposed to fetch data from a Stored Procedure residing in Azure SQL DB. The Stored Procedure accepts one input parameter. I've published these changes already.
When I click on Validate, I get the below error from where I hardly get any information:
{
"code": "BadRequest",
"message": null,
"target": "pipeline//runid/dcb92f70-0a4b-4be1-943b-5ggn68365tyc",
"details": null,
"error": null
}
When I click on Trigger now, it just says 'Failed to run pipeline' without anymore details.
My pipeline JSON is given below:
{
"name": "GetPopulationRecordsForAnalysis",
"properties": {
"description": "Gets Population Records",
"activities": [
{
"name": "GetPopulationRecords",
"type": "SqlServerStoredProcedure",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"storedProcedureName": "[dbo].[usp_GetPopulationRecords]",
"storedProcedureParameters": {
"#countryID": {
"value": "48",
"type": "Int64"
}
}
},
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
}
}
],
"annotations": [],
"lastPublishTime": "2022-08-02T13:37:27Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
What am I doing wrong here?

I have figured out the issue now. The first mistake that I was doing is that I was giving the complete SP name with schema, '['character and all, usp_GetPopulationRecords works just fine. Second is that I was adding an extra '#' character before my Input parameter like how we do while running in SQL Server. That is not required here, only countryIDworks fine. Hope my answer helps.

When we connect to the Linked service in the Settings tab, “Stored Procedure name” dropdown will populate the names of the Stored Procedures present in the Database.
We should select the required Stored procedure and Upon clicking Import it will display parameter Name, Type and Value (supply Value) as below:
We need not give ‘#’ before parameter name.
Corresponding JSON will look below:
This will be Validated successfully in ADF.

Related

Azure Data Factory v2 using utcnow() as a pipeline parameter

For context, I currently have a Data Factory v2 pipeline with a ForEach Activity that calls a Copy Activity. The Copy Activity simply copies data from an FTP server to a blob storage container.
Here is the pipeline json file :
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "ForEach1",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.InputParams",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "FtpDataset",
"type": "DatasetReference",
"parameters": {
"FtpFileName": "#item().FtpFileName",
"FtpFolderPath": "#item().FtpFolderPath"
}
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference",
"parameters": {
"BlobFileName": "#item().BlobFileName",
"BlobFolderPath": "#item().BlobFolderPath"
}
}
]
}
]
}
}
],
"parameters": {
"InputParams": {
"type": "Array",
"defaultValue": [
{
"FtpFolderPath": "/Folder1/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile1"
},
{
"FtpFolderPath": "/Folder2/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile2"
}
]
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
The issue I am having is that when specifying pipeline parameters, it seems I cannot use system variables and functions the same way I can when for example specifying folder paths for a blob storage dataset.
The consequence of this is that formatDateTime(utcnow(), 'yyyyMMdd') is not being interpreted as function calls but rather the actual string with value formatDateTime(utcnow(), 'yyyyMMdd').
To counter this I am guessing I should be using a trigger to execute my pipeline and pass the trigger's execution time as a parameter to the pipeline like trigger().startTime but is this the only way? Am I simply doing something wrong in my pipeline's JSON?
This should work:
File_#{formatDateTime(utcnow(), 'yyyyMMdd')}
Or complex paths as well:
rootfolder/subfolder/#{formatDateTime(utcnow(),'yyyy')}/#{formatDateTime(utcnow(),'MM')}/#{formatDateTime(utcnow(),'dd')}/#{formatDateTime(utcnow(),'HH')}
You can't put a dynamic expression in the default value. You should define this expression and function either when you creating a trigger, or when you define dataset parameters in sink/source in copy activity.
So either you create dataset property FtpFileName with some default value in source dataset, and then in copy activity, you can in source category to specify that dynamic expression.
Another way is to define pipeline parameter, and then to add dynamic expression to that pipeline parameter when you are defining a trigger. Hope this is a clear answer to you. :)
Default value of parameters cannot be expressions. They must be literal strings.
You could use trigger to achieve this. Or you could extract the common part of your expressions and just put literal values into the foreach items.

Azure data factory - intermittent 400 errors when writing to blob storage

I'm using data factory with blob storage.
I sometime get the below error intermittently - this can occur on different pipelines/data-sources. However I always get the same error, regardless of which task fails - 400 The specified block list is invalid.
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorBlobUploadFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error occurred when trying to upload blob 'https://blob.core.windows.net/', detailed message: The remote server returned an error: (400) Bad Request.,Source=,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.
Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage
This seems to be most common if there is more than one task running at a time that is writing data to the storage. Is there anything I can do to make this process more reliable? Is it possible something has been misconfigured? It's causing slices to fail in data factory, so I'd really love to know what I should be investigating.
A sample pipeline that has suffered from this issue:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "Pipeline",
"properties": {
"description": "Pipeline to copy Processed CSV from Data Lake to blob storage",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [ { "name": "DataLake" } ],
"outputs": [ { "name": "Blob" } ],
"policy": {
"concurrency": 10,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyActivity"
}
],
"start": "2016-02-28",
"end": "2016-02-29",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
I'm only using LRS standard storage, but I still wouldn't expect it to intermittently throw errors.
EDIT: adding linked service json
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.LinkedService.json",
"name": "Ls-Staging-Storage",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=;AccountKey="
}
}
}
Such error is mostly caused by racing issues. E.g. multiple concurrent activity runs write to the same blob file.
Could you further check your pipelines settings whether it is the case? And please avoid such setting if so.

How to create a periodic copy of data from OData to Azure DocumentDB

I am trying to make a periodic copy of all the data returning from an OData query into a documentDB collection, on a daily basis.
The copy works fine using the copy wizard, which is A REALLY GREAT option for simple tasks.  Thanks for that.
What isn't working for me though:  The copy just adds data each time, and I have NO WAY that I can SEE with a documentDB sink to "pre-delete" the data in the collection (compare to the SQL sink which has sqlWriterCleanupScript, which I could set to something like Delete * from 'table').
I know I can create an Azure Batch and do what I need, but at this point, I'm not sure that it isn't better to do a function and forego the Azure Data Factory (ADF) for this move.  I'm using ADF for replicating on-prem SQL stuff just fine, because it has the writer cleanup script.
At this point, I'd like to just use DocumentDB but I don't see a way to do it given the way my data works.
Here's a look at my pipeline:
{
"name": "R-------ProjectToDocDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": " "
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
/// this is where a cleanup script would be great.
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "ProjectId:ProjectId,.....:CostClassification"
}
},
"inputs": [
{
"name": "InputDataset-shc"
}
],
"outputs": [
{
"name": "OutputDataset-shc"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-_Custom query_->---Project"
}
],
"start": "2017-04-26T20:13:27.683Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "r-----datafactory01_hub",
"pipelineMode": "Scheduled"
}
}
Perhaps there's an update in the pipeline that creates parity between SQL output and DocumentDB
Azure Data Factory did not support clean up script for DocDB today. It's something in our backlog. If you can describe a little bit more for the E2E scenario, could help us priorities. For example, why append to the same collection not work? Is that because there's no way to identify the incremental records after each run? For the clean up requirement, will that always be delete * or it might be based on time stamp, etc. Thanks. Before the support for clean up script was there, custom activity was the only way to workaround now, sorry.
You could use a Logic App that runs on a Timer Trigger.

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.
Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.
Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity
I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

Call stored procedure using ADF

I am loading SQL server table using ADF and after insertion is over, I have to do little manipulation using below approach
Trigger (After insert) - Failed, SQL server not able to detect inserted record that I push using ADF.. **Seems to be a bug**.
Stored procedure using user defined table type - Getting error
Error Number '156'. Error message from database execution : Incorrect
syntax near the keyword 'select'. Must declare the table variable
"#a".
I have created below pipeline
{
"name": "CopyPipeline-xxx",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "sp_xxx",
"storedProcedureParameters": {
"stringProductData": {
"value": "str1"
}
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "col1:col1,col2:col2"
}
},
"inputs": [
{
"name": "InputDataset-3jg"
}
],
"outputs": [
{
"name": "OutputDataset-3jg"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 8
},
"name": "Activity-0-xxx_csv->[dbo]_[xxx_staging]"
}
],
"start": "2017-01-09T21:48:53.348Z",
"end": "2099-12-30T18:30:00Z",
"isPaused": false,
"hubName": "hub",
"pipelineMode": "Scheduled"
}
}
and using below stored procedure
create procedure [dbo].[sp_xxx] #xxx1 [dbo].[ut_xxx] READONLY, #str1 varchar(100) AS
MERGE xxx_dummy AS a
USING #xxx1 AS b
ON (a.col1 = b.col1)
WHEN NOT MATCHED
THEN INSERT(col1, col2)
VALUES(b.col1, b.col2)
WHEN MATCHED
THEN UPDATE SET a.col2 = b.col2;
Please help me to resolve the issue.
I can reproduce your first error. Inserting to a SQL Server table with Azure Data Factory (ADF) appears to use a bulk insert method (similar to BULK INSERT, bcp, SSIS etc) and by default these methods do not fire triggers:
insert bulk [dbo].[testADF] ([col1] Int, [col2] Int, [col3] Int, [col4] Int)
with (TABLOCK, CHECK_CONSTRAINTS)
With bcp, BULK INSERT there is a flag to change to say 'fire triggers' but it appears there is no way to change this setting for ADF. As a workaround, move the logic from your trigger into the stored proc.
If you believe this flag is important, consider creating a feedback item.

Resources