Building this pipeline on Azure Data Factory V2 - azure

I am currently trying to set up this pipeline on Azure Data Factory V2 (as you can see in the picture attached). In summary this ERP system will export in a monthly basis this report (CSV file with actual and forecast data) and this will be saved in a blob container. As soon as this file CSV is saved, an event trigger should activate this stored procedure that will - in turn - erase all actual data from my fact table in Azure SQL as this gets replaced every month.
Once actual data is deleted, the pipeline would have subsequently a copy activity that would - in turn - copy the CSV report (actuals + forecast) to same fact table in Azure SQL. Once the copy activity is finished, the HTTP logic APP would delete that new CSV file from the blob container. This workflow would be a recurrent event to be carried out very month.
So far I have been able to run these 3 x activities independently. However, when I join them in the same pipeline, I have had some parameters errors when trying to "publish all". Therefore I am not sure whether I need to have the same parameters for each activity in the pipeline?
The JSON code for my pipeline is the following:
{
"name": "TM1_pipeline",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"dependsOn": [
{
"activity": "Stored Procedure1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_e7y",
"type": "DatasetReference",
"parameters": {
"copyFolder": {
"value": "#pipeline().parameters.sourceFolder",
"type": "Expression"
},
"copyFile": {
"value": "#pipeline().parameters.sourceFile",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_e7y",
"type": "DatasetReference"
}
]
},
{
"name": "Stored Procedure1",
"type": "SqlServerStoredProcedure",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"storedProcedureName": "[dbo].[test_sp]"
},
"linkedServiceName": {
"referenceName": "AzureSqlDatabase",
"type": "LinkedServiceReference"
}
},
{
"name": "Web1",
"type": "WebActivity",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"url": "...",
"method": "POST",
"body": {
"value": "#pipeline().parameters.BlobName",
"type": "Expression"
}
}
}
],
"parameters": {
"sourceFolder": {
"type": "String",
"defaultValue": "#pipeline().parameters.sourceFolder"
},
"sourceFile": {
"type": "String",
"defaultValue": "#pipeline().parameters.sourceFile"
},
"BlobName": {
"type": "String",
"defaultValue": {
"blobname": "source-csv/test.csv"
}
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Please follow this doc to configure you blob event trigger and pass the right value to your parameters.

Related

Run a Pipeline when another Pipeline completes on another Data factory

I have two separate Data Factories on my Azure Subscription, lets call them DF-A and the other DF-B
In Data Factory DF-A I have a pipeline and when this has completed, I would like the Pipeline on DF-B to run; how would I achieve this?
Thanks
In Logic app designer, you can create two pipeline run steps to trigger the two pipelines in different Data Factory running.
It is more easier by using logic apps to achieve this.
create a Recurrence trigger to schedule the executions and two Azure Data Factory operations to trigger the pipeline running.
In the Azure Data Factory operations, select Create a pipeline run Action.
The summary is here:
While it's possible, it's much more complicated than one pipeline executing another from within the same Azure Data Factory.
In DF-A create a pipeline called ExecuteExternalPipeline copying the following JSON into the Code tab:
{
"name": "ExecuteExternalPipeline",
"properties": {
"description": "Executes an ADF pipeline in a different ADF",
"activities": [
{
"name": "StartPipelineThenWait",
"description": "Calls the ADF REST API to start a pipeline in another ADF running using the MSI of this current ADF. Then it waits on a webhook callback",
"type": "WebHook",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"url": {
"value": "#concat(\n 'https://management.azure.com/subscriptions/',\n pipeline().parameters.SubscriptionID,\n '/resourceGroups/',pipeline().parameters.ResourceGroup,\n '/providers/Microsoft.DataFactory/factories/',\n pipeline().parameters.DataFactory,\n '/pipelines/',\n pipeline().parameters.Pipeline,\n '/createRun?api-version=2018-06-01'\n)",
"type": "Expression"
},
"method": "POST",
"body": {
"value": "#json(\n concat(\n '{\n \"InputFileName\": \"', pipeline().parameters.InputFileName, '\"\n }'\n )\n)\n",
"type": "Expression"
},
"timeout": "20:00:00",
"authentication": {
"type": "MSI",
"resource": "https://management.azure.com"
}
}
},
{
"name": "ThrowErrorIfFailure",
"type": "IfCondition",
"dependsOn": [
{
"activity": "StartPipelineThenWait",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "#if(equals(activity('StartPipelineThenWait').status,'success'),true,json('throw an error!'))",
"type": "Expression"
}
}
}
],
"parameters": {
"SubscriptionID": {
"type": "string",
"defaultValue": "12345abcd-468e-472a-9761-9da416b14c0d"
},
"ResourceGroup": {
"type": "string",
"defaultValue": "DF-B-RG"
},
"DataFactory": {
"type": "string",
"defaultValue": "DF-B"
},
"Pipeline": {
"type": "string",
"defaultValue": "ChildPipeline"
},
"InputFileName": {
"type": "string",
"defaultValue": "File1.txt"
}
},
"annotations": []
}
}
Then create ChildPipeline in DF-B with the following code:
{
"name": "ChildPipeline",
"properties": {
"activities": [
{
"name": "DoYourLogicHere",
"description": "",
"type": "WebActivity",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "https://google.com",
"type": "Expression"
},
"method": "GET"
}
},
{
"name": "CallbackSuccess",
"description": "Do not remove this activity. It notifies the process which executed this pipeline that the pipeline is complete.",
"type": "WebActivity",
"dependsOn": [
{
"activity": "DoYourLogicHere",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "#pipeline().parameters.callBackUri",
"type": "Expression"
},
"method": "POST",
"body": {
"value": "#json(concat('{\"status\": \"success\", \"pipelineRunId\": \"',pipeline().RunId,'\"}'))",
"type": "Expression"
}
}
},
{
"name": "CallbackFail",
"description": "Do not remove this activity. It notifies the process which executed this pipeline that the pipeline is complete.",
"type": "WebActivity",
"dependsOn": [
{
"activity": "DoYourLogicHere",
"dependencyConditions": [
"Failed"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "#pipeline().parameters.callBackUri",
"type": "Expression"
},
"method": "POST",
"body": {
"value": "#json(concat('{\"status\": \"failure\", \"pipelineRunId\": \"',pipeline().RunId,'\"}'))",
"type": "Expression"
}
}
}
],
"parameters": {
"callBackUri": {
"type": "string",
"defaultValue": "https://google.com"
},
"InputFileName": {
"type": "string",
"defaultValue": "File1.txt"
}
},
"annotations": []
}
}
Replace the DoYourLogicHere activity with your own activities but leave the two callback activities.
Then you need to find the MSI (see the Properties tab of your DF-A in the Azure Portal) for DF-A and make it a Data Factory Contributor on DF-B so that it can execute the pipeline in the other ADF.

Using ADF REST connector to read and transform FHIR data

I am trying to use Azure Data Factory to read data from a FHIR server and transform the results into newline delimited JSON (ndjson) files in Azure Blob storage. Specifically, if you query a FHIR server, you might get something like:
{
"resourceType": "Bundle",
"id": "som-id",
"type": "searchset",
"link": [
{
"relation": "next",
"url": "https://fhirserver/?ct=token"
},
{
"relation": "self",
"url": "https://fhirserver/"
}
],
"entry": [
{
"fullUrl": "https://fhirserver/Organization/1234",
"resource": {
"resourceType": "Organization",
"id": "1234",
// More fields
},
{
"fullUrl": "https://fhirserver/Organization/456",
"resource": {
"resourceType": "Organization",
"id": "456",
// More fields
},
// More resources
]
}
Basically a bundle of resources. I would like to transform that into a newline delimited (aka ndjson) file where each line is just the json for a resource:
{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources
I am able to get the REST connector set up and it can query the FHIR server (including pagination), but no matter what I try I cannot seem to generate the ouput I want. I set up an Azure Blob storage dataset:
{
"name": "AzureBlob1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": "myout.json",
"folderPath": "outfhirfromadf"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
And configure a copy activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"resource": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
But at the end (in spite of configuring the schema mapping), it the end result in the blob is always just the original bundle returned from the server. If I configure the output blob as being a comma delimited text, I can extract fields and create a flattened tabular view, but that is not really what I want.
Any suggestions would be much appreciated.
So I sort of found a solution. If I do the original step of converting where the bundles are simply dumped in the JSON file and then do a nother conversion from the JSON file to what I pretend to be a text file into another blob, I can get the njson file created.
Basically, define another blob dataset:
{
"name": "AzureBlob2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"structure": [
{
"name": "Prop_0",
"type": "String"
}
],
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "",
"quoteChar": "",
"nullValue": "\\N",
"encodingName": null,
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": "myout.json",
"folderPath": "adfjsonout2"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Note that this one TextFormat and also note that the quoteChar is blank. If I then add another Copy Activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['resource']": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
},
{
"name": "Copy Data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"resource": "Prop_0"
}
}
},
"inputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob2",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Then it all works out. It is not ideal in that I now have two copies of the data in blobs, but one can easily be deleted, I suppose.
I would still love to hear about it if somebody has a one-step solution.
As briefly discussed in the comment, the Copy Activity does not provide much functionality aside from mapping data. As stated in the documentation, the Copy activity does the following operations:
Reads data from a source data store.
Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the
configurations of the input dataset, output dataset, and Copy
Activity.
Writes data to the sink/destination data store.
It does not look like that the Copy Activity does anything else aside from efficiently copying stuff around.
What I found out to be working was to use Databrick.
Here are the steps:
Add a Databricks account to your subscription;
Go to the Databricks page by clicking the authoring button;
Create a notebook;
Write the script (Scala, Python or .Net was recently announced).
The script would the following:
Read the data from the Blob storage;
Filter out & transform the data as needed;
Write the data back to a Blob storage;
You can test your script from there and, once ready, you can go back to your pipeline and create a Notebook activity that will point to your notebook containing the script.
I struggled coding in Scala but it was worth it :)
For anyone finding this post in the future you can just can use the $export api call to accomplish this. Note that you have to have a storage account linked to your Fhir server.
https://build.fhir.org/ig/HL7/bulk-data/export.html#endpoint---system-level-export

Loop over each file in folder directory and check date Azure Data Factory V2 -wrong code

I want to loop over each file in a stfp folder and check whether it is new or not and then copy the new files on a Data Lake
Right now I have the below code but I don't think it is correct. There is no usage of #item() in the second GetLastModifyfromFile activity to refer to the items last date in the loop but rather to a completely different data set called SrcLocalFile.
{
"name": "IncrementalloadfromSingleFolder",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalDir",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('GetFileList').output.childItems",
"type": "Expression"
},
"activities": [
{
"name": "GetLastModifyfromFile",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
},
"fieldList": [
"lastModified"
]
}
},
{
"name": "IfNewFile",
"type": "IfCondition",
"dependsOn": [
{
"activity": "GetLastModifyfromFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "#and(less(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.current_time), greaterOrEquals(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.last_time))",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyNewFiles",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "TgtBooksBlob",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
],
"parameters": {
"current_time": {
"type": "String",
"defaultValue": "2018-04-01T00:00:00Z"
},
"last_time": {
"type": "String",
"defaultValue": "2018-03-01T00:00:00Z"
}
},
"folder": {
"name": "IncrementalLoadSingleFolder"
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Just a thought - I don't see your dataset definition but...
Should you pass in path and file name to the dataset as parameters?
i.e. add 2 parameters to the dataset definition for path and file (say pathparam and fileparam). Use those parameters in the dataset's fileName and folderName settings as #dataset().pathparam and #dataset().fileparam.
In the code above, pass in parameters a new "parameters" section of the dataset input with pathparam and fileparam equal to the folder and child item you retrieved from earlier activity.
note - there was a bug that the dataset name could not have spaces in it.

ADF V2 - Parameterize a data copy pipeline based on a table column

with Azure Data Factory V2, through the portal
https://adf.azure.com
I created a Pipeline for incremental copying of data from multiple tables, from one Azure SQL database to another Azure SQL database.
To create it, I have adapted the following example to my needs:
Incrementally load data from multiple tables
Following is the json file related to the pipeline created:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select * \nfrom watermarktable \nwhere TableName = '#{item().TABLE_NAME}'",
"type": "Expression"
}
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select MAX(#{item().WaterMark_Column}) as NewWatermarkvalue \nfrom #{item().TABLE_NAME}",
"type": "Expression"
}
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select * from #{item().TABLE_NAME} \nwhere #{item().WaterMark_Column} > '#{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and #{item().WaterMark_Column} <= '#{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'",
"type": "Expression"
}
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"sqlWriterStoredProcedureName": {
"value": "#{item().StoredProcedureNameForMergeOperation}",
"type": "Expression"
},
"sqlWriterTableType": {
"value": "#{item().TableType}",
"type": "Expression"
}
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "#{item().TABLE_NAME}"
}
}
]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"storedProcedureName": "[dbo].[sp_write_watermark]",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": {
"value": "#{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "Expression"
},
"type": "DateTime"
},
"TableName": {
"value": {
"value": "#{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "Expression"
},
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService_dest",
"type": "LinkedServiceReference"
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object",
"defaultValue": [
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "sp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "sp_upsert_project_table"
}
]
}
}
}
}
In my table I have a column that distinguishes between different companies and so I would like to add another parameter to this pipeline. I have a table like this:
NAME LASTMODIFY COMPANY
John 2015-01-01 00:00:00.000 1
Mike 2016-02-02 01:23:00.000 2
Andy 2017-03-04 05:16:00.000 3
Annie 2018-09-08 00:00:00.000 1
Someone would know how to insert a parameter into the pipeline in order to specify which company to copy and which one to not copy?
Does any suggestion? Thanks in advance to everyone!
Not exactly clear on what you're asking, so apologies if I am missing the mark, but:
Copy allows for a stored procedure that you can use to potentially solve your problem. Take a look at this example: https://learn.microsoft.com/en-us/azure/data-factory/connector-sql-server#invoking-stored-procedure-for-sql-sink
It uses a Stored Procedure to MERGE performing an UPDATE or INSERT dependent on the JOIN matching. It also allows for parameters to be passed.
So if you are trying to COPY only certain cases based on a parameter, the MERGE join may help.

Dynamic Azure Data Factory v2 pipelines

So we've got a factory with ~400 datasets and ~200 pipelines and it's getting unwieldy. Focusing on copying from sql source to blob sink. Since we are copying to blob the schema has no impact. I'd like to have one dataset for each source, one dataset for each blob account and one pipeline for each combination of source/blob account, dynamically feeding it the config from a lookup.
We've successfully developed a pipeline that uses dummy datasets for source and sink. It works if you feed it a query, container name and folder name.
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "DynamicCopy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select 1 a"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureSql",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob",
"type": "DatasetReference",
"parameters": {
"container": "raw-test",
"folder": "test"
}
}
]
}
]
}
}
When we put a lookup before it and wrap it in a foreach, it stops working. With the not so helpful
"errorCode": "400",
"message": "Activity failed because an inner activity failed",
"failureType": "UserError",
"target": "ForEach"
The lookup stored procedure [dbo].[adfdynamic] is not actually referred to in the foreach yet:
create proc adfdynamic as
select 'raw-test' container, 'test_a' folder, 'select 1 a, 2 b'
UNION ALL
select 'raw-test' container, 'test_b' folder, 'select 3 c, 2 d'
So what I desired behaviour is:
one blob in raw-test#..myblob.../test_a/out.dsv with content {'a,b','1,2'}
one blob in raw-test#..myblob.../test_b/out.dsv with content {'c,d','3,2'}
sql dataset:
{
"name": "AzureSql",
"properties": {
"linkedServiceName": {
"referenceName": "Dest",
"type": "LinkedServiceReference"
},
"type": "AzureSqlTable",
"structure": [
{
"name": "CustomerKey",
"type": "Int32"
},
{
"name": "Name",
"type": "String"
}
],
"typeProperties": {
"tableName": "[dbo].[DimCustomer]"
}
}
}
blob dataset:
{
"name": "AzureBlob",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"container": {
"type": "String"
},
"folder": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"treatEmptyAsNull": false,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": {
"value": "#{dataset().folder}/out.dsv",
"type": "Expression"
},
"folderPath": {
"value": "#dataset().container",
"type": "Expression"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
and the non-working dynamic pipeline:
{
"name": "Copy",
"properties": {
"activities": [
{
"name": "ForEach",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('Lookup').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select 1 a, 2 b from dest",
"type": "Expression"
}
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureSql",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob",
"type": "DatasetReference",
"parameters": {
"container": {
"value": "raw-test",
"type": "Expression"
},
"folder": {
"value": "folder",
"type": "Expression"
}
}
}
]
}
]
}
},
{
"name": "Lookup",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
}
}
]
}
}
Apologies about the formatting. too much code in one message?
In you lookup activity, please check whether your firstRowOnly property. Is it false or True? By default, it is true.
In the UI, you could set a breakpoint to debug your lookup activity. Then you could see whether the output is what you want.
Not exactly an answer to your question, but something I did to make life simpler was to create a Dataset called GenericBlob. This had 2 parameters container and path.
This may help simplify what you're doing. I too used to have 20 blob datasets, now I have one ... (this is assuming the blobs are in the same storage account).

Resources