ADF V2 - Parameterize a data copy pipeline based on a table column

ADF V2 - Parameterize a data copy pipeline based on a table column - azure

with Azure Data Factory V2, through the portal
https://adf.azure.com
I created a Pipeline for incremental copying of data from multiple tables, from one Azure SQL database to another Azure SQL database.
To create it, I have adapted the following example to my needs:
Incrementally load data from multiple tables
Following is the json file related to the pipeline created:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select * \nfrom watermarktable \nwhere TableName = '#{item().TABLE_NAME}'",
"type": "Expression"
}
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select MAX(#{item().WaterMark_Column}) as NewWatermarkvalue \nfrom #{item().TABLE_NAME}",
"type": "Expression"
}
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select * from #{item().TABLE_NAME} \nwhere #{item().WaterMark_Column} > '#{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and #{item().WaterMark_Column} <= '#{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'",
"type": "Expression"
}
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"sqlWriterStoredProcedureName": {
"value": "#{item().StoredProcedureNameForMergeOperation}",
"type": "Expression"
},
"sqlWriterTableType": {
"value": "#{item().TableType}",
"type": "Expression"
}
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "#{item().TABLE_NAME}"
}
}
]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"storedProcedureName": "[dbo].[sp_write_watermark]",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": {
"value": "#{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "Expression"
},
"type": "DateTime"
},
"TableName": {
"value": {
"value": "#{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "Expression"
},
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService_dest",
"type": "LinkedServiceReference"
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object",
"defaultValue": [
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "sp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "sp_upsert_project_table"
}
]
}
}
}
}
In my table I have a column that distinguishes between different companies and so I would like to add another parameter to this pipeline. I have a table like this:
NAME LASTMODIFY COMPANY
John 2015-01-01 00:00:00.000 1
Mike 2016-02-02 01:23:00.000 2
Andy 2017-03-04 05:16:00.000 3
Annie 2018-09-08 00:00:00.000 1
Someone would know how to insert a parameter into the pipeline in order to specify which company to copy and which one to not copy?
Does any suggestion? Thanks in advance to everyone!

Not exactly clear on what you're asking, so apologies if I am missing the mark, but:
Copy allows for a stored procedure that you can use to potentially solve your problem. Take a look at this example: https://learn.microsoft.com/en-us/azure/data-factory/connector-sql-server#invoking-stored-procedure-for-sql-sink
It uses a Stored Procedure to MERGE performing an UPDATE or INSERT dependent on the JOIN matching. It also allows for parameters to be passed.
So if you are trying to COPY only certain cases based on a parameter, the MERGE join may help.

Related

How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?

I have some json files in ADLS account. The files are ingested in multiple Year/Month/Day directory structure. I want to copy all the files from ADLS to Azure SQL DB using azure data flow.
I am able to ingest the data from using data flow but I want to include the file path, file ingestion date along with the file name in three separate columns but I do not know how to get these values.
Please note that each Day directory has more than one file as following:
container_name/Dataset/Year/Month/Day/file1.json.file2.json,file3.json
Could any one help me , how do I ingest the modified date column in table with data of each files
tried using getmedata to copy each file on by one also in dataflow derived column for any modified date

I have reproduced the above and able to get the desired file by using combination of addional column option in copy activity, lookup and Get Meta data activity.
In this these are my datasets which I have used at various activities with dataset parameters.
Source_files_wild_path:
temporary_filepaths:
Each_file:
intermediate:
target_folder:
AFAIK, in ADF we can get the last modified date of files either by REST APIs or Get Meta data. But Get Meta data won't work with dynamic file paths with a folder structure like yours.
Also, we can get the file path of a blob file either from triggers or additonal column option of copy activity only. Here, as there is no usage of triggers, I have used the 2nd method.
So, First I have used a copy activity with wild card path to all source files and added $$FILEPATH as column and copied to a temporary file temp1.csv with Merge files as copy behavior.
Then I have used a lookup activity to temp1.csv to get the file as array of objects by which I can get the file paths list.
Here I have created two variables of array type.
As it is lookup output is an array objects, to get only the filename object array, use a for loop and append the #item().filepath to path_list array.
Then use the below expression to get the unique list of all file paths in unique_path_list array.
#union(variables('path_list'),variables('path_list'))
Now, use this array in a ForEach and inside Foreach, use a Get Meta data activity with each_file dataset and #item() as filename and add the filedsList like Item name and Last modified.
Then use copy activity inside Foreach, and use the same dataset. Here add the additional columns like filename, filepath and last modified and give those values.
In sink of this copy activity use another temporary folder and staging(dataset intermediate). give random file name using date function.
After ForEach, use another copy activity with intermediate dataset as source(use wild card path *.csv and give any empty string to dataset parameter) and target_folder folder as sink to get the result file by using merge files.
My pipeline JSON:
{
"name": "last_modifed_pipeline_copy1",
"properties": {
"activities": [
{
"name": "for_paths_columns",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "filepath",
"value": "$$FILEPATH"
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": "*/*/*",
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "MergeFiles"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Source_files_wild_card_path",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "temporary_filepaths",
"type": "DatasetReference"
}
]
},
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [
{
"activity": "for_paths_columns",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"dataset": {
"referenceName": "temporary_filepaths",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "append filepaths array",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#activity('Lookup1').output.value",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable1",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "path_list",
"value": {
"value": "#item().filepath",
"type": "Expression"
}
}
}
]
}
},
{
"name": "get_unique_paths array",
"type": "SetVariable",
"dependsOn": [
{
"activity": "append filepaths array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "unique_path_list",
"value": {
"value": "#union(variables('path_list'),variables('path_list'))",
"type": "Expression"
}
}
},
{
"name": "adds_last modifed column",
"type": "ForEach",
"dependsOn": [
{
"activity": "get_unique_paths array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#variables('unique_path_list')",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Get Metadata1",
"type": "GetMetadata",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "Each_file",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "#item()",
"type": "Expression"
}
}
},
"fieldList": [
"itemName",
"lastModified"
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
}
},
{
"name": "Copy data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Get Metadata1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "file_path",
"value": "$$FILEPATH"
},
{
"name": "file_name",
"value": {
"value": "#activity('Get Metadata1').output.itemName",
"type": "Expression"
}
},
{
"name": "last_modifed",
"value": {
"value": "#activity('Get Metadata1').output.lastModified",
"type": "Expression"
}
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Each_file",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "#item()",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "intermediate",
"type": "DatasetReference",
"parameters": {
"file_name": {
"value": "#concat(utcNow(),'.csv')",
"type": "Expression"
}
}
}
]
}
]
}
},
{
"name": "Copy data3",
"type": "Copy",
"dependsOn": [
{
"activity": "adds_last modifed column",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "MergeFiles"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "intermediate",
"type": "DatasetReference",
"parameters": {
"file_name": "No value"
}
}
],
"outputs": [
{
"referenceName": "target_folder",
"type": "DatasetReference"
}
]
}
],
"variables": {
"path_list": {
"type": "Array"
},
"unique_path_list": {
"type": "Array"
}
},
"annotations": [],
"lastPublishTime": "2023-01-27T12:40:51Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
My pipeline:
Result file:
NOTE:
If you want run this on a regular basis, use Storage event trigger by which you can use trigger parameters like #triggerBody().folderPath and #triggerBody().fileName. you can give these to Get Meta data to get last modified time and then pass it to copy activity or dataflow to add as additonal column as per your requirement.

Loop over each file in folder directory and check date Azure Data Factory V2 -wrong code

I want to loop over each file in a stfp folder and check whether it is new or not and then copy the new files on a Data Lake
Right now I have the below code but I don't think it is correct. There is no usage of #item() in the second GetLastModifyfromFile activity to refer to the items last date in the loop but rather to a completely different data set called SrcLocalFile.
{
"name": "IncrementalloadfromSingleFolder",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalDir",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('GetFileList').output.childItems",
"type": "Expression"
},
"activities": [
{
"name": "GetLastModifyfromFile",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
},
"fieldList": [
"lastModified"
]
}
},
{
"name": "IfNewFile",
"type": "IfCondition",
"dependsOn": [
{
"activity": "GetLastModifyfromFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "#and(less(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.current_time), greaterOrEquals(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.last_time))",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyNewFiles",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "TgtBooksBlob",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
],
"parameters": {
"current_time": {
"type": "String",
"defaultValue": "2018-04-01T00:00:00Z"
},
"last_time": {
"type": "String",
"defaultValue": "2018-03-01T00:00:00Z"
}
},
"folder": {
"name": "IncrementalLoadSingleFolder"
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Just a thought - I don't see your dataset definition but...
Should you pass in path and file name to the dataset as parameters?
i.e. add 2 parameters to the dataset definition for path and file (say pathparam and fileparam). Use those parameters in the dataset's fileName and folderName settings as #dataset().pathparam and #dataset().fileparam.
In the code above, pass in parameters a new "parameters" section of the dataset input with pathparam and fileparam equal to the folder and child item you retrieved from earlier activity.
note - there was a bug that the dataset name could not have spaces in it.

Dynamic Azure Data Factory v2 pipelines

So we've got a factory with ~400 datasets and ~200 pipelines and it's getting unwieldy. Focusing on copying from sql source to blob sink. Since we are copying to blob the schema has no impact. I'd like to have one dataset for each source, one dataset for each blob account and one pipeline for each combination of source/blob account, dynamically feeding it the config from a lookup.
We've successfully developed a pipeline that uses dummy datasets for source and sink. It works if you feed it a query, container name and folder name.
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "DynamicCopy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select 1 a"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureSql",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob",
"type": "DatasetReference",
"parameters": {
"container": "raw-test",
"folder": "test"
}
}
]
}
]
}
}
When we put a lookup before it and wrap it in a foreach, it stops working. With the not so helpful
"errorCode": "400",
"message": "Activity failed because an inner activity failed",
"failureType": "UserError",
"target": "ForEach"
The lookup stored procedure [dbo].[adfdynamic] is not actually referred to in the foreach yet:
create proc adfdynamic as
select 'raw-test' container, 'test_a' folder, 'select 1 a, 2 b'
UNION ALL
select 'raw-test' container, 'test_b' folder, 'select 3 c, 2 d'
So what I desired behaviour is:
one blob in raw-test#..myblob.../test_a/out.dsv with content {'a,b','1,2'}
one blob in raw-test#..myblob.../test_b/out.dsv with content {'c,d','3,2'}
sql dataset:
{
"name": "AzureSql",
"properties": {
"linkedServiceName": {
"referenceName": "Dest",
"type": "LinkedServiceReference"
},
"type": "AzureSqlTable",
"structure": [
{
"name": "CustomerKey",
"type": "Int32"
},
{
"name": "Name",
"type": "String"
}
],
"typeProperties": {
"tableName": "[dbo].[DimCustomer]"
}
}
}
blob dataset:
{
"name": "AzureBlob",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"container": {
"type": "String"
},
"folder": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"treatEmptyAsNull": false,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": {
"value": "#{dataset().folder}/out.dsv",
"type": "Expression"
},
"folderPath": {
"value": "#dataset().container",
"type": "Expression"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
and the non-working dynamic pipeline:
{
"name": "Copy",
"properties": {
"activities": [
{
"name": "ForEach",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('Lookup').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select 1 a, 2 b from dest",
"type": "Expression"
}
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureSql",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob",
"type": "DatasetReference",
"parameters": {
"container": {
"value": "raw-test",
"type": "Expression"
},
"folder": {
"value": "folder",
"type": "Expression"
}
}
}
]
}
]
}
},
{
"name": "Lookup",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
}
}
]
}
}
Apologies about the formatting. too much code in one message?

In you lookup activity, please check whether your firstRowOnly property. Is it false or True? By default, it is true.
In the UI, you could set a breakpoint to debug your lookup activity. Then you could see whether the output is what you want.

Not exactly an answer to your question, but something I did to make life simpler was to create a Dataset called GenericBlob. This had 2 parameters container and path.
This may help simplify what you're doing. I too used to have 20 blob datasets, now I have one ... (this is assuming the blobs are in the same storage account).

How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

I'm trying to backup my Cosmos Db storage using Azure Data Factory(v2). In general, it's doing its job, but I want to have each doc in Cosmos collection to correspond new json file in blobs storage.
With next copying params i'm able to copy all docs in collection into 1 file in azure blob storage:
{
"name": "ForEach_mih",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.cw_items",
"type": "Expression"
},
"activities": [
{
"name": "Copy_mih",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"userProperties": [
{
"name": "Source",
"value": "#{item().source.collectionName}"
},
{
"name": "Destination",
"value": "cosmos-backup-v2/#{item().destination.fileName}"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"path": "cosmos-backup-logs"
},
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_collectionName": "#item().source.collectionName"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_fileName": "#item().destination.fileName"
}
}
]
}
]
}
}
How I can copy each cosmos doc to separate file and give it name the as {PartitionId}-{docId}?
UPD
Source set code:
{
"name": "ClustersData",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_CosmosDb",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "directory-clusters"
}
}
}
Destination set code:
{
"name": "OutputClusters",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "",
"folderPath": "cosmos-backup-logs"
}
}
}
Pipeline code:
{
"name": "copy-clsts",
"properties": {
"activities": [
{
"name": "LookupClst",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"dataset": {
"referenceName": "ClustersData",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachClst",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupClst",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupClst').output.value",
"type": "Expression"
},
"batchCount": 8,
"activities": [
{
"name": "CpyClst",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "select #{item()}",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "ClustersData",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputClusters",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
Example of doc in input collection (all the same format):
{
"$type": "Entities.ADCluster",
"DisplayName": "TESTNetBIOS",
"OrgId": "9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"ClusterId": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"AllowLdapLifeCycleSynchronization": true,
"DirectoryServers": [
{
"$type": "Entities.DirectoryServer",
"AddressId": "e6a8edbb-ad56-4135-94af-fab50b774256",
"Port": 389,
"Host": "192.168.342.234"
}
],
"DomainNames": [
"TESTNetBIOS"
],
"BaseDn": null,
"UseSsl": false,
"RepositoryType": 1,
"DirectoryCustomizations": null,
"_etag": "\"140046f2-0000-0000-0000-5ac63a180000\"",
"LastUpdateTime": "2018-04-05T15:00:40.243Z",
"id": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"PartitionKey": "directory-clusters-9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"_rid": "kpvxLAs6gkmsCQAAAAAAAA==",
"_self": "dbs/kvpxAA==/colls/kpvxLAs6gkk=/docs/kvpxALs6kgmsCQAAAAAAAA==/",
"_attachments": "attachments/",
"_ts": 1522940440
}

Since your cosmosdb has array and ADF doesn't support serialize array for cosmos db, this is the workaround I can provide.
First, export all your document to json files with export json as-is (to blob or adls or file systems, any file storage). I think you already knows how to do it. In this way, each collection will have a json file.
Second, handle each json file, to exact each row in the file to a single file.
I only provide pipeline for step 2. You could use execute pipeline activity to chain step 1 and step 2. And you could even handle all the collections in step 2 with a foreach activity.
Pipeline json
{
"name": "pipeline27",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"dataset": {
"referenceName": "AzureBlob7",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('Lookup1').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select #{item()}",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "DocumentDbCollection1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob6",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "#item().id",
"type": "Expression"
},
"PartitionKey": {
"value": "#item().PartitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
dataset json for lookup
{
"name": "AzureBlob7",
"properties": {
"linkedServiceName": {
"referenceName": "bloblinkedservice",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "cosmos.json",
"folderPath": "aaa"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Source dataset for copy. Actually, this dataset has no use. Just want to use it to host the query (select #{item()}
{
"name": "DocumentDbCollection1",
"properties": {
"linkedServiceName": {
"referenceName": "CosmosDB-r8c",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "test"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Destination dataset. With two parameters, it also addressed your file name request.
{
"name": "AzureBlob6",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage-eastus",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "String"
},
"PartitionKey": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": {
"value": "#{dataset().PartitionKey}-#{dataset().id}.json",
"type": "Expression"
},
"folderPath": "aaacosmos"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
please also note the limitation of Lookup activity:
The following data sources are supported for lookup. The maximum number of rows can be returned by Lookup activity is 5000, and up to 2MB in size. And currently the max duration for Lookup activity before timeout is one hour.

Have you considered implementing this in a different way using Azure Functions? ADF is designed for moving data in bulk from one place to another and only generates a single file per collection.
You could consider having an Azure Function that is triggered when documents are added / updated in your collection and have the Azure Function output the document to blob storage. This should scale well and would be relatively easy to implement.

Just take one collection as an example.
And inside the foreach:
And your lookup and copy activity source dataset reference the same cosmosdb dataset.
If you want to copy your 5 collections, you could put this pipeline into an execute activity. And the master pipeline of the execute activity has a foreach activity.

I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as PartitionKey/DocumentID. Here's the pipelines I created:
LookupTimestamps - loops through each timestamp range from a times.json file, and for each timestamp, executes the ExportFromCosmos pipeline
{
"name": "LookupTimestamps",
"properties": {
"activities": [
{
"name": "LookupTimestamps",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"dataset": {
"referenceName": "BlobStorageTimestamps",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachTimestamp",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupTimestamps",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupTimestamps').output.value",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Execute Pipeline1",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ExportFromCosmos",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"From": {
"value": "#{item().From}",
"type": "Expression"
},
"To": {
"value": "#{item().To}",
"type": "Expression"
}
}
}
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
ExportFromCosmos - nested pipeline that's executed from the above pipeline. This is to get around the fact you can't have nested ForEach activities.
{
"name": "ExportFromCosmos",
"properties": {
"activities": [
{
"name": "LookupDocuments",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select c.id, c.partitionKey from c where c._ts >= #{pipeline().parameters.from} and c._ts <= #{pipeline().parameters.to} order by c._ts desc",
"type": "Expression"
},
"nestingSeparator": "."
},
"dataset": {
"referenceName": "CosmosDb",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachDocument",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupDocuments",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupDocuments').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select * from c where c.id = \"#{item().id}\" and c.partitionKey = \"#{item().partitionKey}\"",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "CosmosDb",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobStorageDocuments",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "#item().id",
"type": "Expression"
},
"partitionKey": {
"value": "#item().partitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"from": {
"type": "int"
},
"to": {
"type": "int"
}
}
}
}
BlobStorageTimestamps - dataset for the times.json file
{
"name": "BlobStorageTimestamps",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "times.json",
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
BlobStorageDocuments - dataset for where the documents will be saved
{
"name": "BlobStorageDocuments",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "string"
},
"partitionKey": {
"type": "string"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": {
"value": "#{dataset().partitionKey}/#{dataset().id}.json",
"type": "Expression"
},
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
The times.json file it just a list of epoch times and looks like this:
[{
"From": 1556150400,
"To": 1556236799
},
{
"From": 1556236800,
"To": 1556323199
}]

Get Metadata Activity ADF V2

Can anyone explain me, what is the use of Get Metadata Activity that is newly introduced in ADF V2?
Actually, the information that is given in learn.microsoft.com isn't enough to understand the uses of this Activity.

Main purpose of the Get Metadata Activity is:
Validate the metadata information of any data
Trigger a pipeline when data is ready/ available
The following example shows how to incrementally load changed files from a folder using the Get Metadata Activity getting filenames and modified Timestamp:
{
"name": "IncrementalloadfromSingleFolder",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalDir",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('GetFileList').output.childItems",
"type": "Expression"
},
"activities": [
{
"name": "GetLastModifyfromFile",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
},
"fieldList": [
"lastModified"
]
}
},
{
"name": "IfNewFile",
"type": "IfCondition",
"dependsOn": [
{
"activity": "GetLastModifyfromFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "#and(less(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.current_time), greaterOrEquals(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.last_time))",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyNewFiles",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "TgtBooksBlob",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
],
"parameters": {
"current_time": {
"type": "String",
"defaultValue": "2018-04-01T00:00:00Z"
},
"last_time": {
"type": "String",
"defaultValue": "2018-03-01T00:00:00Z"
}
},
"folder": {
"name": "IncrementalLoadSingleFolder"
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
See also recently updated documentation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ADF V2 - Parameterize a data copy pipeline based on a table column - azure

Related

How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?

Loop over each file in folder directory and check date Azure Data Factory V2 -wrong code

Dynamic Azure Data Factory v2 pipelines

How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

Get Metadata Activity ADF V2

Categories

Resources