Using ADF REST connector to read and transform FHIR data - azure

I am trying to use Azure Data Factory to read data from a FHIR server and transform the results into newline delimited JSON (ndjson) files in Azure Blob storage. Specifically, if you query a FHIR server, you might get something like:
{
"resourceType": "Bundle",
"id": "som-id",
"type": "searchset",
"link": [
{
"relation": "next",
"url": "https://fhirserver/?ct=token"
},
{
"relation": "self",
"url": "https://fhirserver/"
}
],
"entry": [
{
"fullUrl": "https://fhirserver/Organization/1234",
"resource": {
"resourceType": "Organization",
"id": "1234",
// More fields
},
{
"fullUrl": "https://fhirserver/Organization/456",
"resource": {
"resourceType": "Organization",
"id": "456",
// More fields
},
// More resources
]
}
Basically a bundle of resources. I would like to transform that into a newline delimited (aka ndjson) file where each line is just the json for a resource:
{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources
I am able to get the REST connector set up and it can query the FHIR server (including pagination), but no matter what I try I cannot seem to generate the ouput I want. I set up an Azure Blob storage dataset:
{
"name": "AzureBlob1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": "myout.json",
"folderPath": "outfhirfromadf"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
And configure a copy activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"resource": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
But at the end (in spite of configuring the schema mapping), it the end result in the blob is always just the original bundle returned from the server. If I configure the output blob as being a comma delimited text, I can extract fields and create a flattened tabular view, but that is not really what I want.
Any suggestions would be much appreciated.

So I sort of found a solution. If I do the original step of converting where the bundles are simply dumped in the JSON file and then do a nother conversion from the JSON file to what I pretend to be a text file into another blob, I can get the njson file created.
Basically, define another blob dataset:
{
"name": "AzureBlob2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"structure": [
{
"name": "Prop_0",
"type": "String"
}
],
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "",
"quoteChar": "",
"nullValue": "\\N",
"encodingName": null,
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": "myout.json",
"folderPath": "adfjsonout2"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Note that this one TextFormat and also note that the quoteChar is blank. If I then add another Copy Activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['resource']": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
},
{
"name": "Copy Data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"resource": "Prop_0"
}
}
},
"inputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob2",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Then it all works out. It is not ideal in that I now have two copies of the data in blobs, but one can easily be deleted, I suppose.
I would still love to hear about it if somebody has a one-step solution.

As briefly discussed in the comment, the Copy Activity does not provide much functionality aside from mapping data. As stated in the documentation, the Copy activity does the following operations:
Reads data from a source data store.
Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the
configurations of the input dataset, output dataset, and Copy
Activity.
Writes data to the sink/destination data store.
It does not look like that the Copy Activity does anything else aside from efficiently copying stuff around.
What I found out to be working was to use Databrick.
Here are the steps:
Add a Databricks account to your subscription;
Go to the Databricks page by clicking the authoring button;
Create a notebook;
Write the script (Scala, Python or .Net was recently announced).
The script would the following:
Read the data from the Blob storage;
Filter out & transform the data as needed;
Write the data back to a Blob storage;
You can test your script from there and, once ready, you can go back to your pipeline and create a Notebook activity that will point to your notebook containing the script.
I struggled coding in Scala but it was worth it :)

For anyone finding this post in the future you can just can use the $export api call to accomplish this. Note that you have to have a storage account linked to your Fhir server.
https://build.fhir.org/ig/HL7/bulk-data/export.html#endpoint---system-level-export

Related

Azure data-factory can't load data successfully through PolyBase if the source data in the last column of the first row is null

I am try using Azure DataFactory to load data from Azure Blob Storage to Azure Data warehouse
The relevant data is like below:
source csv:
1,james,
2,john,usa
sink table:
CREATE TABLE test_null (
id int NOT NULL,
name nvarchar(128) NULL,
address nvarchar(128) NULL
)
source dataset:
{
"name": "test_null_input",
"properties": {
"linkedServiceName": {
"referenceName": "StagingBlobStorage",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "1.csv",
"folderPath": "test_null",
"container": "adf"
},
"columnDelimiter": ",",
"escapeChar": "",
"firstRowAsHeader": false,
"quoteChar": ""
},
"schema": []
}
}
sink dataset:
{
"name": "test_null_output",
"properties": {
"linkedServiceName": {
"referenceName": "StagingAzureSqlDW",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "AzureSqlDWTable",
"schema": [
{
"name": "id",
"type": "int",
"precision": 10
},
{
"name": "name",
"type": "nvarchar"
},
{
"name": "address",
"type": "nvarchar"
}
],
"typeProperties": {
"schema": "dbo",
"table": "test_null"
}
}
}
pipeline
{
"name": "test_input",
"properties": {
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectValue": 0,
"rejectType": "value",
"useTypeDefault": false,
"treatEmptyAsNull": true
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"ordinal": 1
},
"sink": {
"name": "id"
}
},
{
"source": {
"ordinal": 2
},
"sink": {
"name": "name"
}
},
{
"source": {
"ordinal": 3
},
"sink": {
"name": "address"
}
}
]
}
},
"inputs": [
{
"referenceName": "test_null_input",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "test_null_output",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
The last column for the first row is null so when run the pipeline it pops out the below error:
ErrorCode=UserErrorInvalidColumnMappingColumnNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: '{"Prop_0":"id","Prop_1":"name","Prop_2":"address"}', Detailed message: Column 'Prop_2' defined in column mapping cannot be found in Source structure.. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'
Tried set the treatEmptyAsNull to true, still the same error. Tried set skipLineCount to 1, it can work well, seems the last column null data in the first row affects the loading of the entire file. But the weirder thing is that it can also work well by enable staging even without setting treatEmptyAsNull and skipLineCount. In my scenario, it is unnecessary to enable it, since it is originally from blob to data warehouse. It seems unreasonable to change from blob to blob and then from blob to data warehouse after enabling, and it will bring additional data movement charges after enabling. I don't know why setting treatEmptyAsNull doesn't work, and then why enabling staging can work,this seems to make no sense?
I have reproduced the above with your Pipeline JSON and got same error.
This error occurred because as per your JSON, this is your copy data mapping between source and sink.
As per the above mapping you should have Prop_0, Prop_1 and Prop_2 as headers.
Here, as you didn't check the First Row as header in your source file, it is taking Prop_0, Prop_1 as headers. Since there is a null value in your first Row there is no Prop_2 column and that is the reason it is giving the error for that column.
To resolve it, Give a proper header your file in csv like below.
Then check the First Row as header in the source dataset.
It will give the mapping like below when you import.
Now, it will Execute successfully as mine.
Result:
You can see that the empty value taken as NULL in target table.

Azure Data Factory, Passing REST GET response to stored procedure in Azure SQL Database

I'm trying to build a (I think) very simple pipeline:
Get the textual body of a GET operation.
Pass the (json) output as-is (= no transformations needed in ADF) to a "Json" parameter of a stored procedure in an Azure SQL Server database. The stored procedure handles the (complex) parsing/mapping.
I thought that this can be done with just 1 Copy activity, but now I think I'm wrong.
In de Copy activity the Sink configuration looks like this:
"sink": {
"type": "AzureSqlSink",
"sqlWriterStoredProcedureName": "[dbo].[spParseJson]",
"sqlWriterTableType": "What to enter here?",
"storedProcedureTableTypeParameterName": "What to enter here?",
"storedProcedureParameters": {
"Json": {
"type": "String",
"value": "<output of Source>"
}
}
}
I really tried to read and understand the documentation, but imho the documentation doesn't explain much or in a bad vague way.
The "output of Source" should be the output from Source. But what function or variable to use for that?
What should I enter for "sqlWriterTableType" / "storedProcedureTableTypeParameterName"? After some digging I understand that ADF will create temp tables and such, but that isn't what I want.
I've also tried an other approach:
Use the Web activity to just download the Json.
Execute SP with the input: #Activity("WebactivityName").output.
But then I found out that the Web activity is limited to 1MB. The Json is about 1,5 MB. If the limit wouldn't be there, then I would have a solution. Argh.
FYI:
The content of the Json has a dynamically changing schema and is not well structured, so there's really no way that I can use the standard mapping capabilities in ADF.
Any help or guidance is appreciated. If you know of some documentation that is informative then that would also help.
I have updated this answer to split it into 2 parts. The first part deals with a simple implementation, which limits the Json response size to ~1MB. The second part deals with more complex implementation that does not impose this limit on Json response size.
Part 1
What you want to do is chain Web activity with Stored procedure Activity.
This will allow you to pass the output from GetJson Web Activity onto the Stored procedure Activity.
Next you will want to add a parameter to your Stored procedure Activity so it can dynamically receive the chained output from the first step.
This should enable you to pass the information successfully.
Here is a Json representation of the Pipeline in question:
{
"name": "get-request-output-to-mssql-stored-procedure",
"properties": {
"activities": [
{
"name": "GetJson",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://jsonplaceholder.typicode.com/posts/1",
"method": "GET"
}
},
{
"name": "Exec stored proc",
"type": "SqlServerStoredProcedure",
"dependsOn": [
{
"activity": "GetJson",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"storedProcedureParameters": {
"Json": {
"value": {
"value": "#activity('GetJson').output",
"type": "Expression"
},
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "your-server-def",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
Part 2
Instead of creating a Web Activity, we will use Lookup activity in combination with a Linked Service Definition and a Json Dataset definition.
You will need to create a Linked Service of type HTTP.
And configure it to use the URL you would like to get the Json response from:
You can then create a new Dataset (of type HTTP) which will use this Linked Service to get data.
And choose Json as its Format type:
You can then set the request URL, and set the schema (unless you need it) to None.
You can then create a Lookup activity which uses the Json dataset as the Source dataset, and set the Request method to GET.
Here is a Json representation of the Pipeline in question:
{
"name": "get-request-output-to-mssql-stored-procedure-2",
"properties": {
"activities": [
{
"name": "Exec stored proc",
"type": "SqlServerStoredProcedure",
"dependsOn": [
{
"activity": "RetrieveJson",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"storedProcedureParameters": {
"Json": {
"value": {
"value": "#activity('RetrieveJson').output",
"type": "Expression"
},
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "ASD_SINGLE",
"type": "LinkedServiceReference"
}
},
{
"name": "RetrieveJson",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "JsonSource",
"storeSettings": {
"type": "HttpReadSettings",
"requestMethod": "GET"
},
"formatSettings": {
"type": "JsonReadSettings"
}
},
"dataset": {
"referenceName": "JsonDataset1",
"type": "DatasetReference"
},
"firstRowOnly": false
}
}
],
"annotations": []
}
}

How to get Azure Data Factory to Loop Through Files in a Folder

I am looking at the link below.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/
We are supposed to have the ability to use wildcard characters in folder paths and file names. If we click on the 'Activity' and click 'Source', we see this view.
I would like to loop through months any days, so it should be something like this view.
Of course that doesn't actually work. I'm getting errors that read: ErrorCode: 'PathNotFound'. Message: 'The specified path does not exist.'. How can I get the tool to recursively iterate through all files in all folders, given a specific pattern of strings in a file path and file name? Thanks.
I would like to loop through months any days
In order to do this you can pass two parameters to the activity from your pipeline so that the path can be build dynamically based on those parameters. ADF V2 allows you to pass parameters.
Let's start the process one by one:
1. Create a pipeline and pass two parameters in it for your month and day.
Note: This parameters can be passed from the output of other activities as well if needed. Reference: Parameters in ADF
2. Create two datasets.
2.1 Sink Dataset - Blob Storage here. Link it with your Linked Service and provide the container name (make sure it is existing). Again if needed, it can be passed as parameters.
2.2 Source Dataset - Blob Storage here again or depends as per your need. Link it with your Linked Service and provide the container name (make sure it is existing). Again if needed, it can be passed as parameters.
Note:
1. The folder path decides the path to copy the data. If the container does not exists, the activity will create for you and if the file already exists the file will get overwritten by default.
2. Pass the parameters in the dataset if you want to build the output path dynamically. Here i have created two parameters for dataset named monthcopy and datacopy.
3. Create Copy Activity in the pipeline.
Wildcard Folder Path:
#{concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),'/',string(pipeline().parameters.month),'/',string(pipeline().parameters.day),'/*')}
where:
The path will become as: current-yyyy/month-passed/day-passed/* (the * will take any folder on one level)
Test Result:
JSON Template for the pipeline:
{
"name": "pipeline2",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFolderPath": {
"value": "#{concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),'/',string(pipeline().parameters.month),'/',string(pipeline().parameters.day),'/*')}",
"type": "Expression"
},
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".csv"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "DelimitedText1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DelimitedText2",
"type": "DatasetReference",
"parameters": {
"monthcopy": {
"value": "#pipeline().parameters.month",
"type": "Expression"
},
"datacopy": {
"value": "#pipeline().parameters.day",
"type": "Expression"
}
}
}
]
}
],
"parameters": {
"month": {
"type": "string"
},
"day": {
"type": "string"
}
},
"annotations": []
}
}
JSON Template for the SINK dataset:
{
"name": "DelimitedText1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "corpdata"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}
JSON Template for the Source Dataset:
{
"name": "DelimitedText2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"monthcopy": {
"type": "string"
},
"datacopy": {
"type": "string"
}
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"folderPath": {
"value": "#concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),dataset().monthcopy,'/',dataset().datacopy)",
"type": "Expression"
},
"container": "copycorpdata"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}

Building this pipeline on Azure Data Factory V2

I am currently trying to set up this pipeline on Azure Data Factory V2 (as you can see in the picture attached). In summary this ERP system will export in a monthly basis this report (CSV file with actual and forecast data) and this will be saved in a blob container. As soon as this file CSV is saved, an event trigger should activate this stored procedure that will - in turn - erase all actual data from my fact table in Azure SQL as this gets replaced every month.
Once actual data is deleted, the pipeline would have subsequently a copy activity that would - in turn - copy the CSV report (actuals + forecast) to same fact table in Azure SQL. Once the copy activity is finished, the HTTP logic APP would delete that new CSV file from the blob container. This workflow would be a recurrent event to be carried out very month.
So far I have been able to run these 3 x activities independently. However, when I join them in the same pipeline, I have had some parameters errors when trying to "publish all". Therefore I am not sure whether I need to have the same parameters for each activity in the pipeline?
The JSON code for my pipeline is the following:
{
"name": "TM1_pipeline",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"dependsOn": [
{
"activity": "Stored Procedure1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_e7y",
"type": "DatasetReference",
"parameters": {
"copyFolder": {
"value": "#pipeline().parameters.sourceFolder",
"type": "Expression"
},
"copyFile": {
"value": "#pipeline().parameters.sourceFile",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_e7y",
"type": "DatasetReference"
}
]
},
{
"name": "Stored Procedure1",
"type": "SqlServerStoredProcedure",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"storedProcedureName": "[dbo].[test_sp]"
},
"linkedServiceName": {
"referenceName": "AzureSqlDatabase",
"type": "LinkedServiceReference"
}
},
{
"name": "Web1",
"type": "WebActivity",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"url": "...",
"method": "POST",
"body": {
"value": "#pipeline().parameters.BlobName",
"type": "Expression"
}
}
}
],
"parameters": {
"sourceFolder": {
"type": "String",
"defaultValue": "#pipeline().parameters.sourceFolder"
},
"sourceFile": {
"type": "String",
"defaultValue": "#pipeline().parameters.sourceFile"
},
"BlobName": {
"type": "String",
"defaultValue": {
"blobname": "source-csv/test.csv"
}
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Please follow this doc to configure you blob event trigger and pass the right value to your parameters.

Dynamic Azure Data Factory v2 pipelines

So we've got a factory with ~400 datasets and ~200 pipelines and it's getting unwieldy. Focusing on copying from sql source to blob sink. Since we are copying to blob the schema has no impact. I'd like to have one dataset for each source, one dataset for each blob account and one pipeline for each combination of source/blob account, dynamically feeding it the config from a lookup.
We've successfully developed a pipeline that uses dummy datasets for source and sink. It works if you feed it a query, container name and folder name.
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "DynamicCopy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select 1 a"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureSql",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob",
"type": "DatasetReference",
"parameters": {
"container": "raw-test",
"folder": "test"
}
}
]
}
]
}
}
When we put a lookup before it and wrap it in a foreach, it stops working. With the not so helpful
"errorCode": "400",
"message": "Activity failed because an inner activity failed",
"failureType": "UserError",
"target": "ForEach"
The lookup stored procedure [dbo].[adfdynamic] is not actually referred to in the foreach yet:
create proc adfdynamic as
select 'raw-test' container, 'test_a' folder, 'select 1 a, 2 b'
UNION ALL
select 'raw-test' container, 'test_b' folder, 'select 3 c, 2 d'
So what I desired behaviour is:
one blob in raw-test#..myblob.../test_a/out.dsv with content {'a,b','1,2'}
one blob in raw-test#..myblob.../test_b/out.dsv with content {'c,d','3,2'}
sql dataset:
{
"name": "AzureSql",
"properties": {
"linkedServiceName": {
"referenceName": "Dest",
"type": "LinkedServiceReference"
},
"type": "AzureSqlTable",
"structure": [
{
"name": "CustomerKey",
"type": "Int32"
},
{
"name": "Name",
"type": "String"
}
],
"typeProperties": {
"tableName": "[dbo].[DimCustomer]"
}
}
}
blob dataset:
{
"name": "AzureBlob",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"container": {
"type": "String"
},
"folder": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"treatEmptyAsNull": false,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": {
"value": "#{dataset().folder}/out.dsv",
"type": "Expression"
},
"folderPath": {
"value": "#dataset().container",
"type": "Expression"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
and the non-working dynamic pipeline:
{
"name": "Copy",
"properties": {
"activities": [
{
"name": "ForEach",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('Lookup').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select 1 a, 2 b from dest",
"type": "Expression"
}
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureSql",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob",
"type": "DatasetReference",
"parameters": {
"container": {
"value": "raw-test",
"type": "Expression"
},
"folder": {
"value": "folder",
"type": "Expression"
}
}
}
]
}
]
}
},
{
"name": "Lookup",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
}
}
]
}
}
Apologies about the formatting. too much code in one message?
In you lookup activity, please check whether your firstRowOnly property. Is it false or True? By default, it is true.
In the UI, you could set a breakpoint to debug your lookup activity. Then you could see whether the output is what you want.
Not exactly an answer to your question, but something I did to make life simpler was to create a Dataset called GenericBlob. This had 2 parameters container and path.
This may help simplify what you're doing. I too used to have 20 blob datasets, now I have one ... (this is assuming the blobs are in the same storage account).

Resources