Problems with rule-based mapping in Azure Dataflow

Problems with rule-based mapping in Azure Dataflow - azure

I have multiple XMLs in an Azure Blob. I want to extract certain datapoints and sort them into an Azure SQL db.
For this I use Data Flow. The name of certain elements changes sometimes [random name] so i would like to set up a rule-based mapping, that fetches the right values every time .
I want to retrieve IMPORTANT INFORMATION, which is always located in the first randomly named child of category_a.
Apart from the randomly named object the rest of the structure always stays the same.
This is about the structure:
<title>
<category_a>
<random_name_1>
<object_a>
<subobject_object_a>
<p>IMPORTANT INFORMATION</p>
</subobject_object_a>
</object_a>
</random_name_1>
<random_name_2>
<object_a>
<subobject_object_a>
<p>IRRELEVANT INFORMATION</p>
</subobject_object_a>
</object_a>
</random_name_2>
</category_a>
<category_b></category_b>
How do I need to write the rule based mapping so that I always fetch this value no matter the random name in the middle of the path?
Thanks for your help

There might be no option to find or use rule-based mapping to retrieve the important information. As a work around, I have used lookup and string manipulation activities to get the result. The following is the pipeline JSON:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "XmlSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "XmlReadSettings",
"validationMode": "none",
"namespaces": true
}
},
"dataset": {
"referenceName": "Xml1",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "tp",
"value": {
"value": "#replace(replace(split(string(activity('Lookup1').output.value[0].title.category_a),':')[0],'\"',''),'{','')",
"type": "Expression"
}
}
},
{
"name": "Set variable2",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Set variable1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "tp2",
"value": {
"value": "#replace(replace(split(string(activity('Lookup1').output.value[0].title.category_a[variables('tp')]),':')[0],'\"',''),'{','')",
"type": "Expression"
}
}
},
{
"name": "Set variable3",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Set variable2",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "tp3",
"value": {
"value": "#replace(replace(split(string(activity('Lookup1').output.value[0].title.category_a[variables('tp')][variables('tp2')]),':')[0],'\"',''),'{','')",
"type": "Expression"
}
}
},
{
"name": "Set variable4",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Set variable3",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "tp4",
"value": {
"value": "#replace(replace(split(string(activity('Lookup1').output.value[0].title.category_a[variables('tp')][variables('tp2')][variables('tp3')]),':')[0],'\"',''),'{','')",
"type": "Expression"
}
}
},
{
"name": "Set variable5",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Set variable4",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "final",
"value": {
"value": "#activity('Lookup1').output.value[0].title.category_a[variables('tp')][variables('tp2')][variables('tp3')][variables('tp4')]",
"type": "Expression"
}
}
}
],
"variables": {
"tp": {
"type": "String"
},
"tp2": {
"type": "String"
},
"tp3": {
"type": "String"
},
"tp4": {
"type": "String"
},
"final": {
"type": "String"
}
},
"annotations": []
}
}
The lookup has the source xml file. Since we know the category_a is a child element, I have started from there to obtain the child element names (in each set variable activity).
The following is the output image for reference:

Related

How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?

I have some json files in ADLS account. The files are ingested in multiple Year/Month/Day directory structure. I want to copy all the files from ADLS to Azure SQL DB using azure data flow.
I am able to ingest the data from using data flow but I want to include the file path, file ingestion date along with the file name in three separate columns but I do not know how to get these values.
Please note that each Day directory has more than one file as following:
container_name/Dataset/Year/Month/Day/file1.json.file2.json,file3.json
Could any one help me , how do I ingest the modified date column in table with data of each files
tried using getmedata to copy each file on by one also in dataflow derived column for any modified date

I have reproduced the above and able to get the desired file by using combination of addional column option in copy activity, lookup and Get Meta data activity.
In this these are my datasets which I have used at various activities with dataset parameters.
Source_files_wild_path:
temporary_filepaths:
Each_file:
intermediate:
target_folder:
AFAIK, in ADF we can get the last modified date of files either by REST APIs or Get Meta data. But Get Meta data won't work with dynamic file paths with a folder structure like yours.
Also, we can get the file path of a blob file either from triggers or additonal column option of copy activity only. Here, as there is no usage of triggers, I have used the 2nd method.
So, First I have used a copy activity with wild card path to all source files and added $$FILEPATH as column and copied to a temporary file temp1.csv with Merge files as copy behavior.
Then I have used a lookup activity to temp1.csv to get the file as array of objects by which I can get the file paths list.
Here I have created two variables of array type.
As it is lookup output is an array objects, to get only the filename object array, use a for loop and append the #item().filepath to path_list array.
Then use the below expression to get the unique list of all file paths in unique_path_list array.
#union(variables('path_list'),variables('path_list'))
Now, use this array in a ForEach and inside Foreach, use a Get Meta data activity with each_file dataset and #item() as filename and add the filedsList like Item name and Last modified.
Then use copy activity inside Foreach, and use the same dataset. Here add the additional columns like filename, filepath and last modified and give those values.
In sink of this copy activity use another temporary folder and staging(dataset intermediate). give random file name using date function.
After ForEach, use another copy activity with intermediate dataset as source(use wild card path *.csv and give any empty string to dataset parameter) and target_folder folder as sink to get the result file by using merge files.
My pipeline JSON:
{
"name": "last_modifed_pipeline_copy1",
"properties": {
"activities": [
{
"name": "for_paths_columns",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "filepath",
"value": "$$FILEPATH"
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": "*/*/*",
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "MergeFiles"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Source_files_wild_card_path",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "temporary_filepaths",
"type": "DatasetReference"
}
]
},
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [
{
"activity": "for_paths_columns",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"dataset": {
"referenceName": "temporary_filepaths",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "append filepaths array",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#activity('Lookup1').output.value",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable1",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "path_list",
"value": {
"value": "#item().filepath",
"type": "Expression"
}
}
}
]
}
},
{
"name": "get_unique_paths array",
"type": "SetVariable",
"dependsOn": [
{
"activity": "append filepaths array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "unique_path_list",
"value": {
"value": "#union(variables('path_list'),variables('path_list'))",
"type": "Expression"
}
}
},
{
"name": "adds_last modifed column",
"type": "ForEach",
"dependsOn": [
{
"activity": "get_unique_paths array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#variables('unique_path_list')",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Get Metadata1",
"type": "GetMetadata",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "Each_file",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "#item()",
"type": "Expression"
}
}
},
"fieldList": [
"itemName",
"lastModified"
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
}
},
{
"name": "Copy data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Get Metadata1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "file_path",
"value": "$$FILEPATH"
},
{
"name": "file_name",
"value": {
"value": "#activity('Get Metadata1').output.itemName",
"type": "Expression"
}
},
{
"name": "last_modifed",
"value": {
"value": "#activity('Get Metadata1').output.lastModified",
"type": "Expression"
}
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Each_file",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "#item()",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "intermediate",
"type": "DatasetReference",
"parameters": {
"file_name": {
"value": "#concat(utcNow(),'.csv')",
"type": "Expression"
}
}
}
]
}
]
}
},
{
"name": "Copy data3",
"type": "Copy",
"dependsOn": [
{
"activity": "adds_last modifed column",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "MergeFiles"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "intermediate",
"type": "DatasetReference",
"parameters": {
"file_name": "No value"
}
}
],
"outputs": [
{
"referenceName": "target_folder",
"type": "DatasetReference"
}
]
}
],
"variables": {
"path_list": {
"type": "Array"
},
"unique_path_list": {
"type": "Array"
}
},
"annotations": [],
"lastPublishTime": "2023-01-27T12:40:51Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
My pipeline:
Result file:
NOTE:
If you want run this on a regular basis, use Storage event trigger by which you can use trigger parameters like #triggerBody().folderPath and #triggerBody().fileName. you can give these to Get Meta data to get last modified time and then pass it to copy activity or dataflow to add as additonal column as per your requirement.

Copy Files from a folder to multiple folders based on the file name in Azure Data Factory

I have a parent folder in ADLS Gen2 called Source which has number of subfolders and these subfolders contain the actual data files as shown in in the below example...
***Source: ***
Folder Name: 20221212
A_20221212.txt B_20221212.txt C_20221212.txt
Folder Name: 20221219
A_20221219.txt B_20221219.txt C_20221219.txt
Folder Name: 20221226
A_20221226.txt B_20221226.txt C_20221226.txt
How can I copy files from subfolders to name specific folders (should create a new folder if it does not exist) using Azure Data Factory, please see the example below...
***Target: ***
Folder Name: A
A_20221212.txt A_20221219.txt A_20221226.txt
Folder Name: B
B_20221212.txt B_20221219.txt B_20221226.txt
Folder Name: C
C_20221212.txt C_20221219.txt C_20221226.txt
Really appreciate your and help.

I have reproduced the above and got below results.
You can follow the below procedure using Get Meta data activity if you have the folder directories at same level.
This is my source folder structure.
data
20221212
A_20221212.txt
B_20221212.txt
C_20221212.txt`
20221219
A_20221219.txt
B_20221219.txt
C_20221219.txt
20221226
A_20221226.txt
B_20221226.txt
C_20221226.txt
Source dataset:
Give this to Get Meta data activity and use ChildItems.
Then Give the ChildItems array from Get Meta data activity to a ForEach activity. Inside ForEach I have used set variable for storing folder name.
#split(item().name,'_')[0]
Now, use copy activity and in source use wild card path like below.
For sink create dataset parameters and give it copy activity sink like below.
My pipeline JSON:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Get Metadata1",
"type": "GetMetadata",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "sourcetxt",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Get Metadata1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#activity('Get Metadata1').output.childItems",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [
{
"activity": "Set variable1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": "*",
"wildcardFileName": {
"value": "#item().name",
"type": "Expression"
},
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "sourcetxt",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "targettxts",
"type": "DatasetReference",
"parameters": {
"folder_name": {
"value": "#variables('folder_name')",
"type": "Expression"
},
"file_name": {
"value": "#item().name",
"type": "Expression"
}
}
}
]
},
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "folder_name",
"value": {
"value": "#split(item().name,'_')[0]",
"type": "Expression"
}
}
}
]
}
}
],
"variables": {
"folder_name": {
"type": "String"
}
},
"annotations": []
}
}
Result:

Dropping columns in Azure Data Factory based on values in columns

I'm working with a dataset where I need to drop some columns which contain only NULL values. The issue is that the column names are not consistent or similar, and can change with time. I was wondering if there is a way in ADF to drop a column if all instances are NULL without having drifted columns?
I have tried unpivoting, removing rows, then re-pivoting, however after I pivot the data back to its original format, I get the following message:
"This drifted column is not in the source schema and therefore can only be referenced with pattern matching expressions"
The drifted columns don't seem to join on subsequent join functions. I have also tried setting derived columns with regex column patters to make all the drifted columns explicit, however, the byName() function doesn't seem to work with the $$ syntax; namely:
toString(byName($$))
Any ideas of how to solve this within Azure Data Factory - Data Flows would be very much appreciated!

I have used combination of both data factory pipeline activities and dataflow to achieve the requirement.
First, I have taken dataflow to output a file. I have added a new column with all values as 1 so that I can use aggregate on all other rows using this new column to group.
I have used collect() to create an array for each of the column where group by is on above created column.
Now create another derived column to replace the array by converting array to string and calculating length. If length is 2 it indicates that column contains all nulls.
Write this dataflow output to a file. The data preview of the sink will be as follows:
Create a dataflow activity to run the above dataflow and pass the following dynamic content in execute pipeline activity to filter out and write data of only required columns.
#activity('Data flow1').output.runstatus.profile.sink1.total
In pipeline2, I have used activities to get columns that are not entirely nulls, create a dynamic schema and then use this schema as mapping and write to a file only the required columns.
First, I have read the file written at the end of dataflow without header (even though the file has header). The dataset looks as shown below:
You can directly use the following pipeline JSON to build the pipeline:
{
"name": "pipeline2",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"dataset": {
"referenceName": "cols",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#range(0,pipeline().parameters.count_of_rows)",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable1",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "props",
"value": {
"value": "Prop_#{item()}",
"type": "Expression"
}
}
}
]
}
},
{
"name": "ForEach2",
"type": "ForEach",
"dependsOn": [
{
"activity": "ForEach1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#variables('props')",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable2",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "req_cols",
"value": {
"value": "#if(and(not(equals(activity('Lookup1').output.value[0][item()],'tp')),not(equals(activity('Lookup1').output.value[1][item()],'2'))),activity('Lookup1').output.value[0][item()],'')",
"type": "Expression"
}
}
}
]
}
},
{
"name": "Filter1",
"type": "Filter",
"dependsOn": [
{
"activity": "ForEach2",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#variables('req_cols')",
"type": "Expression"
},
"condition": {
"value": "#not(equals(item(),''))",
"type": "Expression"
}
}
},
{
"name": "ForEach3",
"type": "ForEach",
"dependsOn": [
{
"activity": "Filter1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#activity('Filter1').output.Value",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable3",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "mapping",
"value": {
"value": "#json(concat('{\"source\":{\"name\":\"',item(),'\"},\"sink\":{\"name\":\"',item(),'\"}}'))",
"type": "Expression"
}
}
}
]
}
},
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [
{
"activity": "ForEach3",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "dynamic_schema",
"value": {
"value": "#concat('{\"type\":\"TabularTranslator\",\"mappings\":',string(variables('mapping')),'}}')",
"type": "Expression"
}
}
},
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [
{
"activity": "Set variable1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"value": "#json(variables('dynamic_schema'))",
"type": "Expression"
}
},
"inputs": [
{
"referenceName": "csv1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "req_file",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"count_of_rows": {
"type": "int"
}
},
"variables": {
"props": {
"type": "Array"
},
"req_cols": {
"type": "Array"
},
"test": {
"type": "String"
},
"mapping": {
"type": "Array"
},
"dynamic_schema": {
"type": "String"
}
},
"annotations": []
}
}
NOTE: In the copy data activity, the source is the original file.

If the source column names will change, then you have to use column patterns. When you match columns based on patterns, you can project those into columns using the Select transformation. Use the rule-based mapping option in the Select transformation with true() as the matching expression and $$ as the Name As property like this:

Azure datafactory v2 Execute Pipeline with For Each

I am trying to use "Execute Pipeline" to invoke a Pipe which has a ForEach activity. I get an error.
Json for Execute pipe:
[
{
"name": "pipeline3",
"properties": {
"activities": [
{
"name": "Test_invoke1",
"type": "ExecutePipeline",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"pipeline": {
"referenceName": "MAIN_SA_copy1",
"type": "PipelineReference"
},
"waitOnCompletion": true
}
}
],
"annotations": []
}
}
]
Jason for Invoke pipe for each activity :
[
{
"name": "MAIN_SA_copy1",
"properties": {
"activities": [
{
"name": "Collect_SA_Data",
"type": "ForEach",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#pipeline().parameters.TableNames",
"type": "Expression"
},
"batchCount": 15,
"activities": [
{
"name": "Sink_SAdata_toDL",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Destination",
"value": "#{pipeline().parameters.DLFilePath}/#{item()}"
}
],
"typeProperties": {
"source": {
"type": "SqlServerSource",
"sqlReaderQuery": {
"value": "#concat('SELECT * FROM ',item())",
"type": "Expression"
}
},
"sink": {
"type": "AzureBlobFSSink"
},
"enableStaging": false,
"parallelCopies": 1,
"dataIntegrationUnits": 4
},
"inputs": [
{
"referenceName": "SrcDS_StructuringAnalytics",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "ADLS",
"type": "DatasetReference",
"parameters": {
"FilePath": "#pipeline().parameters.DLFilePath",
"FileName": {
"value": "#concat(item(),'.orc')",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"DLFilePath": {
"type": "string",
"defaultValue": "extracts/StructuringAnalytics"
},
"TableNames": {
"type": "array",
"defaultValue": [
"fom.FOMLineItem_manual"
]
}
},
"variables": {
"QryTableColumn": {
"type": "String"
},
"QryTable": {
"type": "String"
}
},
"folder": {
"name": "StructuringAnalytics"
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
]
I get an error:
[
{
"errorCode": "BadRequest",
"message": "Operation on target Collect_SA_Data failed: The execution of template action 'Collect_SA_Data' failed: the result of the evaluation of 'foreach' expression '#pipeline().parameters.TableNames' is of type 'String'. The result must be a valid array.",
"failureType": "UserError",
"target": "Test_invoke1",
"details": ""
}
]
Input:
"pipeline": {
"referenceName": "MAIN_SA_copy1",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"DLFilePath": "extracts/StructuringAnalytics",
"TableNames": "[\"fom.FOMLineItem_manual\"]"
}

Please try updating your dynamic expression of ForEach Items as below:
{
"value": "#array(pipeline().parameters.TableNames)",
"type": "Expression"
}
Hope this helps.

I guess you were using the UI to set the pipeline and its parameters and I guess you expected to put the array parameter of the called pipeline as everywhere else like this:
(It is all my guess, because I just did exactly the same, with the same result)
The trick is to define the array in the Code (["table1", "table2"]):
The input in the UI will look like this:
Now it works!
It seems, that the Datafactory is otherwise treating the whole array as one element of some array. Hence, the solution with the array() function sometimes works.
It looks like a bug, defining array parameter input..
(Had to edit the answer, I first thought omiting the colons in the UI input would be enough)

How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

I'm trying to backup my Cosmos Db storage using Azure Data Factory(v2). In general, it's doing its job, but I want to have each doc in Cosmos collection to correspond new json file in blobs storage.
With next copying params i'm able to copy all docs in collection into 1 file in azure blob storage:
{
"name": "ForEach_mih",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.cw_items",
"type": "Expression"
},
"activities": [
{
"name": "Copy_mih",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"userProperties": [
{
"name": "Source",
"value": "#{item().source.collectionName}"
},
{
"name": "Destination",
"value": "cosmos-backup-v2/#{item().destination.fileName}"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"path": "cosmos-backup-logs"
},
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_collectionName": "#item().source.collectionName"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_fileName": "#item().destination.fileName"
}
}
]
}
]
}
}
How I can copy each cosmos doc to separate file and give it name the as {PartitionId}-{docId}?
UPD
Source set code:
{
"name": "ClustersData",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_CosmosDb",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "directory-clusters"
}
}
}
Destination set code:
{
"name": "OutputClusters",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "",
"folderPath": "cosmos-backup-logs"
}
}
}
Pipeline code:
{
"name": "copy-clsts",
"properties": {
"activities": [
{
"name": "LookupClst",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"dataset": {
"referenceName": "ClustersData",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachClst",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupClst",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupClst').output.value",
"type": "Expression"
},
"batchCount": 8,
"activities": [
{
"name": "CpyClst",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "select #{item()}",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "ClustersData",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputClusters",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
Example of doc in input collection (all the same format):
{
"$type": "Entities.ADCluster",
"DisplayName": "TESTNetBIOS",
"OrgId": "9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"ClusterId": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"AllowLdapLifeCycleSynchronization": true,
"DirectoryServers": [
{
"$type": "Entities.DirectoryServer",
"AddressId": "e6a8edbb-ad56-4135-94af-fab50b774256",
"Port": 389,
"Host": "192.168.342.234"
}
],
"DomainNames": [
"TESTNetBIOS"
],
"BaseDn": null,
"UseSsl": false,
"RepositoryType": 1,
"DirectoryCustomizations": null,
"_etag": "\"140046f2-0000-0000-0000-5ac63a180000\"",
"LastUpdateTime": "2018-04-05T15:00:40.243Z",
"id": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"PartitionKey": "directory-clusters-9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"_rid": "kpvxLAs6gkmsCQAAAAAAAA==",
"_self": "dbs/kvpxAA==/colls/kpvxLAs6gkk=/docs/kvpxALs6kgmsCQAAAAAAAA==/",
"_attachments": "attachments/",
"_ts": 1522940440
}

Since your cosmosdb has array and ADF doesn't support serialize array for cosmos db, this is the workaround I can provide.
First, export all your document to json files with export json as-is (to blob or adls or file systems, any file storage). I think you already knows how to do it. In this way, each collection will have a json file.
Second, handle each json file, to exact each row in the file to a single file.
I only provide pipeline for step 2. You could use execute pipeline activity to chain step 1 and step 2. And you could even handle all the collections in step 2 with a foreach activity.
Pipeline json
{
"name": "pipeline27",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"dataset": {
"referenceName": "AzureBlob7",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('Lookup1').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select #{item()}",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "DocumentDbCollection1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob6",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "#item().id",
"type": "Expression"
},
"PartitionKey": {
"value": "#item().PartitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
dataset json for lookup
{
"name": "AzureBlob7",
"properties": {
"linkedServiceName": {
"referenceName": "bloblinkedservice",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "cosmos.json",
"folderPath": "aaa"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Source dataset for copy. Actually, this dataset has no use. Just want to use it to host the query (select #{item()}
{
"name": "DocumentDbCollection1",
"properties": {
"linkedServiceName": {
"referenceName": "CosmosDB-r8c",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "test"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Destination dataset. With two parameters, it also addressed your file name request.
{
"name": "AzureBlob6",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage-eastus",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "String"
},
"PartitionKey": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": {
"value": "#{dataset().PartitionKey}-#{dataset().id}.json",
"type": "Expression"
},
"folderPath": "aaacosmos"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
please also note the limitation of Lookup activity:
The following data sources are supported for lookup. The maximum number of rows can be returned by Lookup activity is 5000, and up to 2MB in size. And currently the max duration for Lookup activity before timeout is one hour.

Have you considered implementing this in a different way using Azure Functions? ADF is designed for moving data in bulk from one place to another and only generates a single file per collection.
You could consider having an Azure Function that is triggered when documents are added / updated in your collection and have the Azure Function output the document to blob storage. This should scale well and would be relatively easy to implement.

Just take one collection as an example.
And inside the foreach:
And your lookup and copy activity source dataset reference the same cosmosdb dataset.
If you want to copy your 5 collections, you could put this pipeline into an execute activity. And the master pipeline of the execute activity has a foreach activity.

I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as PartitionKey/DocumentID. Here's the pipelines I created:
LookupTimestamps - loops through each timestamp range from a times.json file, and for each timestamp, executes the ExportFromCosmos pipeline
{
"name": "LookupTimestamps",
"properties": {
"activities": [
{
"name": "LookupTimestamps",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"dataset": {
"referenceName": "BlobStorageTimestamps",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachTimestamp",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupTimestamps",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupTimestamps').output.value",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Execute Pipeline1",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ExportFromCosmos",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"From": {
"value": "#{item().From}",
"type": "Expression"
},
"To": {
"value": "#{item().To}",
"type": "Expression"
}
}
}
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
ExportFromCosmos - nested pipeline that's executed from the above pipeline. This is to get around the fact you can't have nested ForEach activities.
{
"name": "ExportFromCosmos",
"properties": {
"activities": [
{
"name": "LookupDocuments",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select c.id, c.partitionKey from c where c._ts >= #{pipeline().parameters.from} and c._ts <= #{pipeline().parameters.to} order by c._ts desc",
"type": "Expression"
},
"nestingSeparator": "."
},
"dataset": {
"referenceName": "CosmosDb",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachDocument",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupDocuments",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupDocuments').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select * from c where c.id = \"#{item().id}\" and c.partitionKey = \"#{item().partitionKey}\"",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "CosmosDb",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobStorageDocuments",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "#item().id",
"type": "Expression"
},
"partitionKey": {
"value": "#item().partitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"from": {
"type": "int"
},
"to": {
"type": "int"
}
}
}
}
BlobStorageTimestamps - dataset for the times.json file
{
"name": "BlobStorageTimestamps",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "times.json",
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
BlobStorageDocuments - dataset for where the documents will be saved
{
"name": "BlobStorageDocuments",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "string"
},
"partitionKey": {
"type": "string"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": {
"value": "#{dataset().partitionKey}/#{dataset().id}.json",
"type": "Expression"
},
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
The times.json file it just a list of epoch times and looks like this:
[{
"From": 1556150400,
"To": 1556236799
},
{
"From": 1556236800,
"To": 1556323199
}]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Problems with rule-based mapping in Azure Dataflow - azure

Related

How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?

Copy Files from a folder to multiple folders based on the file name in Azure Data Factory

Dropping columns in Azure Data Factory based on values in columns

Azure datafactory v2 Execute Pipeline with For Each

How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

Categories

Resources