Azure Data factory parameter - azure

I am trying to load the data from the on-premise sql server to Sql Server at VM. I need to do it every day. For the same, I have created a trigger. Trigger is inserting the data properly. But now, I need to insert triggerID in the destination columns for every run in a column.
I don't know what mistake i am doing. I found many blogs on the same but all have information when we are extracting the data from a blob not from sql server.
I was trying to insert the value of the same like this but it's giving error.
"Activity Copy Data1 failed: Please choose only one of the three property "name", "path" and "ordinal" to reference columns for "source" and "sink" under "mappings" property. "
pipeline details. Please suggest
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "SqlServerSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Name",
"type": "String"
},
"sink": {
"name": "Name",
"type": "String"
}
},
{
"source": {
"type": "String",
"name": "#pipeline().parameters.triggerIDVal"
},
"sink": {
"name": "TriggerID",
"type": "String"
}
}
]
}
},
"inputs": [
{
"referenceName": "AzureSqlTable1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SqlServerSQLDEV02",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"triggerIDVal": {
"type": "string"
}
},
"annotations": []
}
}
I want that each time trigger is executed then the triggerID should be populating into the destination column TriggerID.

Firstly,please see the limitation in the copy activity column mapping:
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
So,i don't think you could do the data transfer plus trigger id which is not contained by the source columns.My idea is:
1.First use a Set Variable activity to get the trigger id value.
2.Then connect with copy activity and pass the value as parameter.
3.In the sink of copy activity, you could invoke stored procedure to combine the trigger id with other columns before the row is inserted into table. More details, please see this document.

Related

Azure data-factory can't load data successfully through PolyBase if the source data in the last column of the first row is null

I am try using Azure DataFactory to load data from Azure Blob Storage to Azure Data warehouse
The relevant data is like below:
source csv:
1,james,
2,john,usa
sink table:
CREATE TABLE test_null (
id int NOT NULL,
name nvarchar(128) NULL,
address nvarchar(128) NULL
)
source dataset:
{
"name": "test_null_input",
"properties": {
"linkedServiceName": {
"referenceName": "StagingBlobStorage",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "1.csv",
"folderPath": "test_null",
"container": "adf"
},
"columnDelimiter": ",",
"escapeChar": "",
"firstRowAsHeader": false,
"quoteChar": ""
},
"schema": []
}
}
sink dataset:
{
"name": "test_null_output",
"properties": {
"linkedServiceName": {
"referenceName": "StagingAzureSqlDW",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "AzureSqlDWTable",
"schema": [
{
"name": "id",
"type": "int",
"precision": 10
},
{
"name": "name",
"type": "nvarchar"
},
{
"name": "address",
"type": "nvarchar"
}
],
"typeProperties": {
"schema": "dbo",
"table": "test_null"
}
}
}
pipeline
{
"name": "test_input",
"properties": {
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectValue": 0,
"rejectType": "value",
"useTypeDefault": false,
"treatEmptyAsNull": true
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"ordinal": 1
},
"sink": {
"name": "id"
}
},
{
"source": {
"ordinal": 2
},
"sink": {
"name": "name"
}
},
{
"source": {
"ordinal": 3
},
"sink": {
"name": "address"
}
}
]
}
},
"inputs": [
{
"referenceName": "test_null_input",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "test_null_output",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
The last column for the first row is null so when run the pipeline it pops out the below error:
ErrorCode=UserErrorInvalidColumnMappingColumnNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: '{"Prop_0":"id","Prop_1":"name","Prop_2":"address"}', Detailed message: Column 'Prop_2' defined in column mapping cannot be found in Source structure.. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'
Tried set the treatEmptyAsNull to true, still the same error. Tried set skipLineCount to 1, it can work well, seems the last column null data in the first row affects the loading of the entire file. But the weirder thing is that it can also work well by enable staging even without setting treatEmptyAsNull and skipLineCount. In my scenario, it is unnecessary to enable it, since it is originally from blob to data warehouse. It seems unreasonable to change from blob to blob and then from blob to data warehouse after enabling, and it will bring additional data movement charges after enabling. I don't know why setting treatEmptyAsNull doesn't work, and then why enabling staging can work,this seems to make no sense?
I have reproduced the above with your Pipeline JSON and got same error.
This error occurred because as per your JSON, this is your copy data mapping between source and sink.
As per the above mapping you should have Prop_0, Prop_1 and Prop_2 as headers.
Here, as you didn't check the First Row as header in your source file, it is taking Prop_0, Prop_1 as headers. Since there is a null value in your first Row there is no Prop_2 column and that is the reason it is giving the error for that column.
To resolve it, Give a proper header your file in csv like below.
Then check the First Row as header in the source dataset.
It will give the mapping like below when you import.
Now, it will Execute successfully as mine.
Result:
You can see that the empty value taken as NULL in target table.

How to get Azure Data Factory to Loop Through Files in a Folder

I am looking at the link below.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/
We are supposed to have the ability to use wildcard characters in folder paths and file names. If we click on the 'Activity' and click 'Source', we see this view.
I would like to loop through months any days, so it should be something like this view.
Of course that doesn't actually work. I'm getting errors that read: ErrorCode: 'PathNotFound'. Message: 'The specified path does not exist.'. How can I get the tool to recursively iterate through all files in all folders, given a specific pattern of strings in a file path and file name? Thanks.
I would like to loop through months any days
In order to do this you can pass two parameters to the activity from your pipeline so that the path can be build dynamically based on those parameters. ADF V2 allows you to pass parameters.
Let's start the process one by one:
1. Create a pipeline and pass two parameters in it for your month and day.
Note: This parameters can be passed from the output of other activities as well if needed. Reference: Parameters in ADF
2. Create two datasets.
2.1 Sink Dataset - Blob Storage here. Link it with your Linked Service and provide the container name (make sure it is existing). Again if needed, it can be passed as parameters.
2.2 Source Dataset - Blob Storage here again or depends as per your need. Link it with your Linked Service and provide the container name (make sure it is existing). Again if needed, it can be passed as parameters.
Note:
1. The folder path decides the path to copy the data. If the container does not exists, the activity will create for you and if the file already exists the file will get overwritten by default.
2. Pass the parameters in the dataset if you want to build the output path dynamically. Here i have created two parameters for dataset named monthcopy and datacopy.
3. Create Copy Activity in the pipeline.
Wildcard Folder Path:
#{concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),'/',string(pipeline().parameters.month),'/',string(pipeline().parameters.day),'/*')}
where:
The path will become as: current-yyyy/month-passed/day-passed/* (the * will take any folder on one level)
Test Result:
JSON Template for the pipeline:
{
"name": "pipeline2",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFolderPath": {
"value": "#{concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),'/',string(pipeline().parameters.month),'/',string(pipeline().parameters.day),'/*')}",
"type": "Expression"
},
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".csv"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "DelimitedText1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DelimitedText2",
"type": "DatasetReference",
"parameters": {
"monthcopy": {
"value": "#pipeline().parameters.month",
"type": "Expression"
},
"datacopy": {
"value": "#pipeline().parameters.day",
"type": "Expression"
}
}
}
]
}
],
"parameters": {
"month": {
"type": "string"
},
"day": {
"type": "string"
}
},
"annotations": []
}
}
JSON Template for the SINK dataset:
{
"name": "DelimitedText1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "corpdata"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}
JSON Template for the Source Dataset:
{
"name": "DelimitedText2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"monthcopy": {
"type": "string"
},
"datacopy": {
"type": "string"
}
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"folderPath": {
"value": "#concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),dataset().monthcopy,'/',dataset().datacopy)",
"type": "Expression"
},
"container": "copycorpdata"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}

How do I map my row key in source to row key in sink? Not working as expected

I am using Azure Data Factory to transform a blob, then I am trying to copy it in Azure Table Storage. I am using the copy activity. The pipeline consists of the data flow, then the copy activity.
Everything is working well except for mapping my rowkey. I used the transform activity to generate a rowkey based on product SKUs. When I go to sink and select "use source column", it shows a "select option" under "row key column" but my only option is "Add dynamic content" and I don't get any column options.
My rowkey in the source is mapped via the mapping function, but it appears that without specifying dynamic content it's being overridden with a unique identifier by default. Since these are product SKUs and I want to merge data, I absolutely need my own rowkeys.
I am new to ADF, I've tried referencing my rowkey using dynamic content but it only generates not found errors. I'm not sure if I'm referencing the source right when writing an expression.
I've contacted Azure support and I'm awaiting a response, but we'll see.
{
"name": "BlobTransformsToTable",
"properties": {
"activities": [
{
"name": "TransformThenCopyToTable",
"type": "Copy",
"dependsOn": [
{
"activity": "MyVendorCatalog",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSetting",
"wildcardFileName": "*.*"
},
"formatSettings": {
"type": "DelimitedTextReadSetting"
}
},
"sink": {
"type": "AzureTableSink",
"azureTableInsertType": "merge",
"azureTableDefaultPartitionKeyValue": "CatalogItem",
"writeBatchSize": 10000
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "PartitionKey",
"type": "String"
},
"sink": {
"name": "PartitionKey"
}
},
{
"source": {
"name": "RowKey",
"type": "String"
},
"sink": {
"name": "RowKey"
}
},
Screenshot1
Screenshot2
I am expecting it to map rowkey in source to rowkey in sink. It's still generating unique values. I'm open to working around the problem by using dynamic content, but I'm not sure how to get it working.

Creating a container in azure storage before uploading the data using ADF v2

Thanks in advance, I'm new to ADF and have created a pipeline from ADF portal. Source On premise server folder and destination dataset is Azure Blob storage. I'm using tumbling window which passes the date start time and
Date End time and only uploads the latest data using lastmodified datetime.
Query : If i want to create subcontainers on fly in azure storage i use /container/$monthvariable and it automatically creates a subcontainer based on the month variable
example here my source is
dfac/
$monthvariable = 5
if i put
dfac/$monthvariable
then all the files will be uploaded under dfac/5/ and will look like below
dfac/5/file1
dfac/5/file2
dfac/5/file3
Here in ADF i wanted to get the month of the pipeline month and add that in pipeline. Is that what i can do? and where can i define the variable?
{
"name": "Destination",
"value": "dfac/$monthvariable"// does it work and is this the right way to do this stuff
}
My Actual code looks like below.
{
"name": "Copy_ayy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 2,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "/*"
},
{
"name": "Destination",
"value": "dfac/"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "SourceDataset_ayy",
"type": "DatasetReference",
"parameters": {
"cw_modifiedDatetimeStart": "#pipeline().parameters.windowStart",
"cw_modifiedDatetimeEnd": "#pipeline().parameters.windowEnd"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_ayy",
"type": "DatasetReference"
}
]
}
I believe you are using copy data tool. Then you can also use it help you for the destination path part. It will help you create the parameters.

Using ADF REST connector to read and transform FHIR data

I am trying to use Azure Data Factory to read data from a FHIR server and transform the results into newline delimited JSON (ndjson) files in Azure Blob storage. Specifically, if you query a FHIR server, you might get something like:
{
"resourceType": "Bundle",
"id": "som-id",
"type": "searchset",
"link": [
{
"relation": "next",
"url": "https://fhirserver/?ct=token"
},
{
"relation": "self",
"url": "https://fhirserver/"
}
],
"entry": [
{
"fullUrl": "https://fhirserver/Organization/1234",
"resource": {
"resourceType": "Organization",
"id": "1234",
// More fields
},
{
"fullUrl": "https://fhirserver/Organization/456",
"resource": {
"resourceType": "Organization",
"id": "456",
// More fields
},
// More resources
]
}
Basically a bundle of resources. I would like to transform that into a newline delimited (aka ndjson) file where each line is just the json for a resource:
{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources
I am able to get the REST connector set up and it can query the FHIR server (including pagination), but no matter what I try I cannot seem to generate the ouput I want. I set up an Azure Blob storage dataset:
{
"name": "AzureBlob1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": "myout.json",
"folderPath": "outfhirfromadf"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
And configure a copy activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"resource": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
But at the end (in spite of configuring the schema mapping), it the end result in the blob is always just the original bundle returned from the server. If I configure the output blob as being a comma delimited text, I can extract fields and create a flattened tabular view, but that is not really what I want.
Any suggestions would be much appreciated.
So I sort of found a solution. If I do the original step of converting where the bundles are simply dumped in the JSON file and then do a nother conversion from the JSON file to what I pretend to be a text file into another blob, I can get the njson file created.
Basically, define another blob dataset:
{
"name": "AzureBlob2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"structure": [
{
"name": "Prop_0",
"type": "String"
}
],
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "",
"quoteChar": "",
"nullValue": "\\N",
"encodingName": null,
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": "myout.json",
"folderPath": "adfjsonout2"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Note that this one TextFormat and also note that the quoteChar is blank. If I then add another Copy Activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['resource']": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
},
{
"name": "Copy Data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"resource": "Prop_0"
}
}
},
"inputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob2",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Then it all works out. It is not ideal in that I now have two copies of the data in blobs, but one can easily be deleted, I suppose.
I would still love to hear about it if somebody has a one-step solution.
As briefly discussed in the comment, the Copy Activity does not provide much functionality aside from mapping data. As stated in the documentation, the Copy activity does the following operations:
Reads data from a source data store.
Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the
configurations of the input dataset, output dataset, and Copy
Activity.
Writes data to the sink/destination data store.
It does not look like that the Copy Activity does anything else aside from efficiently copying stuff around.
What I found out to be working was to use Databrick.
Here are the steps:
Add a Databricks account to your subscription;
Go to the Databricks page by clicking the authoring button;
Create a notebook;
Write the script (Scala, Python or .Net was recently announced).
The script would the following:
Read the data from the Blob storage;
Filter out & transform the data as needed;
Write the data back to a Blob storage;
You can test your script from there and, once ready, you can go back to your pipeline and create a Notebook activity that will point to your notebook containing the script.
I struggled coding in Scala but it was worth it :)
For anyone finding this post in the future you can just can use the $export api call to accomplish this. Note that you have to have a storage account linked to your Fhir server.
https://build.fhir.org/ig/HL7/bulk-data/export.html#endpoint---system-level-export

Resources