Get Metadata Activity ADF V2

Get Metadata Activity ADF V2 - azure

Can anyone explain me, what is the use of Get Metadata Activity that is newly introduced in ADF V2?
Actually, the information that is given in learn.microsoft.com isn't enough to understand the uses of this Activity.

Main purpose of the Get Metadata Activity is:
Validate the metadata information of any data
Trigger a pipeline when data is ready/ available
The following example shows how to incrementally load changed files from a folder using the Get Metadata Activity getting filenames and modified Timestamp:
{
"name": "IncrementalloadfromSingleFolder",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalDir",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('GetFileList').output.childItems",
"type": "Expression"
},
"activities": [
{
"name": "GetLastModifyfromFile",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
},
"fieldList": [
"lastModified"
]
}
},
{
"name": "IfNewFile",
"type": "IfCondition",
"dependsOn": [
{
"activity": "GetLastModifyfromFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "#and(less(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.current_time), greaterOrEquals(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.last_time))",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyNewFiles",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "TgtBooksBlob",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
],
"parameters": {
"current_time": {
"type": "String",
"defaultValue": "2018-04-01T00:00:00Z"
},
"last_time": {
"type": "String",
"defaultValue": "2018-03-01T00:00:00Z"
}
},
"folder": {
"name": "IncrementalLoadSingleFolder"
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
See also recently updated documentation.

Related

how to use Lookup activity and if activity together

i have a file on storage account emp.csv that contains. i want that from storage account a file and b file would go to database table.
emp_id,filename
1,a
2,b
3,anubhav
so for this i pass emp.csv file on lookup activity as source dataset then i use foreach activity
Inside foreach activity i used a if condition on expression
#equals(item().filename,'anubhav' )
if this expression is true then wait activity will come and wait for 1 sec. if this expression false then
but this pipeline is failing

The pipeline is failing because inside copy activity dataset properties it should be a string, but you have given it as an array value #activity('Lookup1').output.value. because of that, you getting an error.
Try to replace the array value with the string #item().filename as you can see, I reproduce the same thing in my environment and got this output.
You can use this Json pipeline activity
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"dataset": {
"referenceName": "DelimitedText1",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#activity('Lookup1').output.value",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "If Condition1",
"type": "IfCondition",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "#equals(item().filename, 'anubhav.csv')",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "abcsv",
"type": "DatasetReference",
"parameters": {
"file": {
"value": "#item().filename",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "DelimitedText2",
"type": "DatasetReference",
"parameters": {
"file": {
"value": "#item().filename",
"type": "Expression"
}
}
}
]
}
],
"ifTrueActivities": [
{
"name": "Wait1",
"type": "Wait",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"waitTimeInSeconds": 1
}
}
]
}
}
]
}
}
],
"annotations": []
}
}
Pipeline successfully executed

Azure Data Factory - Export data to sub container/blob

Hi I have an ADF that copies (exports Azure SQL data) CSV files to a blob.
How can I direct the the files - the destination to a 'sub' container
I have blob Named 'SQLdata' , I want the files to be create in sub-container/blob called customers
SQLdata/Customers
SQLdata/Customers/Cust1.csv
SQLdata/Customers/Cust2.csv
I have tried
"destination": {
"fileName": "Customers//Cust1.csv"
What is wrong with the following?
"activities": [
{
"name": "Export",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "dbo.#{item().source.table}"
},
{
"name": "Destination",
"value": "#{item().destination.fileName}"
}
],
"parameters": {
"cw_items": {
"type": "Array",
"defaultValue": [
{
"source": {
"table": "Cust1"
},
"destination": {
"fileName": "Cust1.csv"
}
},
{
"source": {
"table": "Cust2"
},
"destination": {
"fileName": "Cust2.csv"
}
},

I tried the same export and it works well, all the csv files is stored in containerleon/csv:
JSON code reference:
{
"name": "CopyPipeline_fls",
"properties": {
"activities": [
{
"name": "ForEach_fls",
"type": "ForEach",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#pipeline().parameters.cw_items",
"type": "Expression"
},
"activities": [
{
"name": "Copy_fls",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "dbo.#{item().source.table}"
},
{
"name": "Destination",
"value": "containerleon/csv/#{item().destination.fileName}"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "SourceDataset_fls",
"type": "DatasetReference",
"parameters": {
"cw_table": "#item().source.table"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_fls",
"type": "DatasetReference",
"parameters": {
"cw_fileName": "#item().destination.fileName"
}
}
]
}
]
}
}
],
"parameters": {
"cw_items": {
"type": "Array",
"defaultValue": [
{
"source": {
"table": "test"
},
"destination": {
"fileName": "dbotest.csv"
}
},
{
"source": {
"table": "test3"
},
"destination": {
"fileName": "dbotest3.csv"
}
}
]
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Storage preview:
Hope this helps.

Loop over each file in folder directory and check date Azure Data Factory V2 -wrong code

I want to loop over each file in a stfp folder and check whether it is new or not and then copy the new files on a Data Lake
Right now I have the below code but I don't think it is correct. There is no usage of #item() in the second GetLastModifyfromFile activity to refer to the items last date in the loop but rather to a completely different data set called SrcLocalFile.
{
"name": "IncrementalloadfromSingleFolder",
"properties": {
"activities": [
{
"name": "GetFileList",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalDir",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
]
}
},
{
"name": "ForEachFile",
"type": "ForEach",
"dependsOn": [
{
"activity": "GetFileList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('GetFileList').output.childItems",
"type": "Expression"
},
"activities": [
{
"name": "GetLastModifyfromFile",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"dataset": {
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
},
"fieldList": [
"lastModified"
]
}
},
{
"name": "IfNewFile",
"type": "IfCondition",
"dependsOn": [
{
"activity": "GetLastModifyfromFile",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "#and(less(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.current_time), greaterOrEquals(activity('GetLastModifyfromFile').output.lastModified, pipeline().parameters.last_time))",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyNewFiles",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SrcLocalFile",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "TgtBooksBlob",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
],
"parameters": {
"current_time": {
"type": "String",
"defaultValue": "2018-04-01T00:00:00Z"
},
"last_time": {
"type": "String",
"defaultValue": "2018-03-01T00:00:00Z"
}
},
"folder": {
"name": "IncrementalLoadSingleFolder"
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

Just a thought - I don't see your dataset definition but...
Should you pass in path and file name to the dataset as parameters?
i.e. add 2 parameters to the dataset definition for path and file (say pathparam and fileparam). Use those parameters in the dataset's fileName and folderName settings as #dataset().pathparam and #dataset().fileparam.
In the code above, pass in parameters a new "parameters" section of the dataset input with pathparam and fileparam equal to the folder and child item you retrieved from earlier activity.
note - there was a bug that the dataset name could not have spaces in it.

ADF V2 - Parameterize a data copy pipeline based on a table column

with Azure Data Factory V2, through the portal
https://adf.azure.com
I created a Pipeline for incremental copying of data from multiple tables, from one Azure SQL database to another Azure SQL database.
To create it, I have adapted the following example to my needs:
Incrementally load data from multiple tables
Following is the json file related to the pipeline created:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select * \nfrom watermarktable \nwhere TableName = '#{item().TABLE_NAME}'",
"type": "Expression"
}
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select MAX(#{item().WaterMark_Column}) as NewWatermarkvalue \nfrom #{item().TABLE_NAME}",
"type": "Expression"
}
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "select * from #{item().TABLE_NAME} \nwhere #{item().WaterMark_Column} > '#{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and #{item().WaterMark_Column} <= '#{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'",
"type": "Expression"
}
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"sqlWriterStoredProcedureName": {
"value": "#{item().StoredProcedureNameForMergeOperation}",
"type": "Expression"
},
"sqlWriterTableType": {
"value": "#{item().TableType}",
"type": "Expression"
}
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"SinkTableName": "#{item().TABLE_NAME}"
}
}
]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"storedProcedureName": "[dbo].[sp_write_watermark]",
"storedProcedureParameters": {
"LastModifiedtime": {
"value": {
"value": "#{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type": "Expression"
},
"type": "DateTime"
},
"TableName": {
"value": {
"value": "#{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type": "Expression"
},
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "SqlServerLinkedService_dest",
"type": "LinkedServiceReference"
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object",
"defaultValue": [
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "sp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "sp_upsert_project_table"
}
]
}
}
}
}
In my table I have a column that distinguishes between different companies and so I would like to add another parameter to this pipeline. I have a table like this:
NAME LASTMODIFY COMPANY
John 2015-01-01 00:00:00.000 1
Mike 2016-02-02 01:23:00.000 2
Andy 2017-03-04 05:16:00.000 3
Annie 2018-09-08 00:00:00.000 1
Someone would know how to insert a parameter into the pipeline in order to specify which company to copy and which one to not copy?
Does any suggestion? Thanks in advance to everyone!

Not exactly clear on what you're asking, so apologies if I am missing the mark, but:
Copy allows for a stored procedure that you can use to potentially solve your problem. Take a look at this example: https://learn.microsoft.com/en-us/azure/data-factory/connector-sql-server#invoking-stored-procedure-for-sql-sink
It uses a Stored Procedure to MERGE performing an UPDATE or INSERT dependent on the JOIN matching. It also allows for parameters to be passed.
So if you are trying to COPY only certain cases based on a parameter, the MERGE join may help.

How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

I'm trying to backup my Cosmos Db storage using Azure Data Factory(v2). In general, it's doing its job, but I want to have each doc in Cosmos collection to correspond new json file in blobs storage.
With next copying params i'm able to copy all docs in collection into 1 file in azure blob storage:
{
"name": "ForEach_mih",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.cw_items",
"type": "Expression"
},
"activities": [
{
"name": "Copy_mih",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"userProperties": [
{
"name": "Source",
"value": "#{item().source.collectionName}"
},
{
"name": "Destination",
"value": "cosmos-backup-v2/#{item().destination.fileName}"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"path": "cosmos-backup-logs"
},
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_collectionName": "#item().source.collectionName"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_fileName": "#item().destination.fileName"
}
}
]
}
]
}
}
How I can copy each cosmos doc to separate file and give it name the as {PartitionId}-{docId}?
UPD
Source set code:
{
"name": "ClustersData",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_CosmosDb",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "directory-clusters"
}
}
}
Destination set code:
{
"name": "OutputClusters",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "",
"folderPath": "cosmos-backup-logs"
}
}
}
Pipeline code:
{
"name": "copy-clsts",
"properties": {
"activities": [
{
"name": "LookupClst",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"dataset": {
"referenceName": "ClustersData",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachClst",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupClst",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupClst').output.value",
"type": "Expression"
},
"batchCount": 8,
"activities": [
{
"name": "CpyClst",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "select #{item()}",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "ClustersData",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputClusters",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
Example of doc in input collection (all the same format):
{
"$type": "Entities.ADCluster",
"DisplayName": "TESTNetBIOS",
"OrgId": "9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"ClusterId": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"AllowLdapLifeCycleSynchronization": true,
"DirectoryServers": [
{
"$type": "Entities.DirectoryServer",
"AddressId": "e6a8edbb-ad56-4135-94af-fab50b774256",
"Port": 389,
"Host": "192.168.342.234"
}
],
"DomainNames": [
"TESTNetBIOS"
],
"BaseDn": null,
"UseSsl": false,
"RepositoryType": 1,
"DirectoryCustomizations": null,
"_etag": "\"140046f2-0000-0000-0000-5ac63a180000\"",
"LastUpdateTime": "2018-04-05T15:00:40.243Z",
"id": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"PartitionKey": "directory-clusters-9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"_rid": "kpvxLAs6gkmsCQAAAAAAAA==",
"_self": "dbs/kvpxAA==/colls/kpvxLAs6gkk=/docs/kvpxALs6kgmsCQAAAAAAAA==/",
"_attachments": "attachments/",
"_ts": 1522940440
}

Since your cosmosdb has array and ADF doesn't support serialize array for cosmos db, this is the workaround I can provide.
First, export all your document to json files with export json as-is (to blob or adls or file systems, any file storage). I think you already knows how to do it. In this way, each collection will have a json file.
Second, handle each json file, to exact each row in the file to a single file.
I only provide pipeline for step 2. You could use execute pipeline activity to chain step 1 and step 2. And you could even handle all the collections in step 2 with a foreach activity.
Pipeline json
{
"name": "pipeline27",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"dataset": {
"referenceName": "AzureBlob7",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('Lookup1').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select #{item()}",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "DocumentDbCollection1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob6",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "#item().id",
"type": "Expression"
},
"PartitionKey": {
"value": "#item().PartitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
dataset json for lookup
{
"name": "AzureBlob7",
"properties": {
"linkedServiceName": {
"referenceName": "bloblinkedservice",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "cosmos.json",
"folderPath": "aaa"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Source dataset for copy. Actually, this dataset has no use. Just want to use it to host the query (select #{item()}
{
"name": "DocumentDbCollection1",
"properties": {
"linkedServiceName": {
"referenceName": "CosmosDB-r8c",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "test"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Destination dataset. With two parameters, it also addressed your file name request.
{
"name": "AzureBlob6",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage-eastus",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "String"
},
"PartitionKey": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": {
"value": "#{dataset().PartitionKey}-#{dataset().id}.json",
"type": "Expression"
},
"folderPath": "aaacosmos"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
please also note the limitation of Lookup activity:
The following data sources are supported for lookup. The maximum number of rows can be returned by Lookup activity is 5000, and up to 2MB in size. And currently the max duration for Lookup activity before timeout is one hour.

Have you considered implementing this in a different way using Azure Functions? ADF is designed for moving data in bulk from one place to another and only generates a single file per collection.
You could consider having an Azure Function that is triggered when documents are added / updated in your collection and have the Azure Function output the document to blob storage. This should scale well and would be relatively easy to implement.

Just take one collection as an example.
And inside the foreach:
And your lookup and copy activity source dataset reference the same cosmosdb dataset.
If you want to copy your 5 collections, you could put this pipeline into an execute activity. And the master pipeline of the execute activity has a foreach activity.

I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as PartitionKey/DocumentID. Here's the pipelines I created:
LookupTimestamps - loops through each timestamp range from a times.json file, and for each timestamp, executes the ExportFromCosmos pipeline
{
"name": "LookupTimestamps",
"properties": {
"activities": [
{
"name": "LookupTimestamps",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"dataset": {
"referenceName": "BlobStorageTimestamps",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachTimestamp",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupTimestamps",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupTimestamps').output.value",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Execute Pipeline1",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ExportFromCosmos",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"From": {
"value": "#{item().From}",
"type": "Expression"
},
"To": {
"value": "#{item().To}",
"type": "Expression"
}
}
}
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
ExportFromCosmos - nested pipeline that's executed from the above pipeline. This is to get around the fact you can't have nested ForEach activities.
{
"name": "ExportFromCosmos",
"properties": {
"activities": [
{
"name": "LookupDocuments",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select c.id, c.partitionKey from c where c._ts >= #{pipeline().parameters.from} and c._ts <= #{pipeline().parameters.to} order by c._ts desc",
"type": "Expression"
},
"nestingSeparator": "."
},
"dataset": {
"referenceName": "CosmosDb",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachDocument",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupDocuments",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('LookupDocuments').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select * from c where c.id = \"#{item().id}\" and c.partitionKey = \"#{item().partitionKey}\"",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "CosmosDb",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobStorageDocuments",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "#item().id",
"type": "Expression"
},
"partitionKey": {
"value": "#item().partitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"from": {
"type": "int"
},
"to": {
"type": "int"
}
}
}
}
BlobStorageTimestamps - dataset for the times.json file
{
"name": "BlobStorageTimestamps",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "times.json",
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
BlobStorageDocuments - dataset for where the documents will be saved
{
"name": "BlobStorageDocuments",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "string"
},
"partitionKey": {
"type": "string"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": {
"value": "#{dataset().partitionKey}/#{dataset().id}.json",
"type": "Expression"
},
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
The times.json file it just a list of epoch times and looks like this:
[{
"From": 1556150400,
"To": 1556236799
},
{
"From": 1556236800,
"To": 1556323199
}]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Get Metadata Activity ADF V2 - azure

Can anyone explain me, what is the use of Get Metadata Activity that is newly introduced in ADF V2? Actually, the information that is given in learn.microsoft.com isn't enough to understand the uses of this Activity.

Related

how to use Lookup activity and if activity together

Azure Data Factory - Export data to sub container/blob

Loop over each file in folder directory and check date Azure Data Factory V2 -wrong code

ADF V2 - Parameterize a data copy pipeline based on a table column

How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

Categories

Resources