Custom date in azure blob folder path - azure

I have looked at some posts and documentation on how to specify custom folder paths while creating an azure blob (using the azure data factories).
Official documentation:
https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-azure-blob-connector#using-partitionedBy-property
Forums posts:
https://dba.stackexchange.com/questions/180487/datafactory-tutorial-blob-does-not-exist
I am successfully able to put into date indexed folders, however what I am not able to do is put into incremented/decremented date folders.
I tried using $$Text.Format (like below) but it gives a compile error --> Text.Format is not a valid blob path .
"folderPath": "$$Text.Format('MyRoot/{0:yyyy/MM/dd}/', Date.AddDays(SliceEnd,-2))",
I tried using the PartitionedBy section (like below) but it too gives a compile error --> Only SliceStart and SliceEnd are valid options for "date"
{
"name": "MyBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "MyLinkedService",
"typeProperties": {
"fileName": "MyTsv.tsv",
"folderPath": "MyRoot/{Year}/{Month}/{Day}/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t",
"nullValue": ""
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "Date.AddDays(SliceEnd,-2)",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "Date.AddDays(SliceEnd,-2)",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "Date.AddDays(SliceEnd,-2)",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
Any pointers are appreciated!
EDIT for response from Adam:
I also used folder structure directly in FileName as per suggestion from Adam as per below forum post:
Windows Azure: How to create sub directory in a blob container
I used it like in below sample.
"typeProperties": {
"fileName": "$$Text.Format('{0:yyyy/MM/dd}/MyBlob.tsv', Date.AddDays(SliceEnd,-2))",
"folderPath": "MyRoot/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t",
"nullValue": ""
},
It gives no compile error and also no error during deployment. But it throws an error during execution!!
Runtime Error is ---> Error in Activity: ScopeJobManager:PrepareScopeScript, Unsupported unstructured stream format '.adddays(sliceend,-2))', can't convert to unstructured stream.
I think the problem is that FileName can be used to create folders but not dynamic folder names, only static ones.

you should create a blob using the following convention: "foldername/myfile.txt" , so you could also append additional blobs under that foldername. I'd recommend checking this thread: Windows Azure: How to create sub directory in a blob container , It may help you resolve this case.

Related

how to get multiple folder path on the azure logic app

I have a scenario : I want to build an azure logic app, where I have to got documents from various folder from the Sharepoint get process and give email notification. My confusion is how can I give multiple input folder path?
I'm going to make an assumptions with your architecture in my answer. I'm assuming you want to process multiple files in different sites within the same SharePoint tenant. So, not across tenants.
To achieve what you're asking for, I created a Parse JSON action which takes in the following structure (as an example, obviously the structure is the key point here, not the data) ...
Scenario 1 - Specific Files
[
{
"SiteName": "ExampleSolution",
"FileName": "/Shared Documents/General/Book.xlsx"
},
{
"SiteName": "TestSite",
"FileName": "/Shared Documents/Test Folder/Document.docx"
}
]
The SP tenant needs to be authenticated to with the appropriate user.
Then, in a For Each action, loop through each item and retrieve the contents of each document using the Get file content using path action.
Site Address = concat('https://yourtenant.sharepoint.com/sites/', items('For_each')?['SiteName'])
File Path = File Name (from Dynamic Content)
It will then retrieve the contents dynamically using those expressions.
File 1 (Excel Document)
File 2 (Word Document)
Scenario 2 - All Files
If you want to do it for all files, just change it up slightly ...
[
{
"FolderName": "/Shared Documents/General",
"SiteName": "ExampleSolution"
},
{
"FolderName": "/Shared Documents/Test Folder",
"SiteName": "TestSite"
}
]
Site Address = concat('https://yourtenant.sharepoint.com/sites/', items('For_each')?['SiteName'])
File Identifier = Folder Name (from Dynamic Content)
Output - Folder 1
[
{
"Id": "%252fShared%2bDocuments%252fGeneral%252fBook.xlsx",
"Name": "Book.xlsx",
"DisplayName": "Book.xlsx",
"Path": "/Shared Documents/General/Book.xlsx",
"LastModified": "2021-12-24T02:56:14Z",
"Size": 15330,
"MediaType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"IsFolder": false,
"ETag": "\"{23948609-0DA0-43E0-994C-2703FEEC8567},7\"",
"FileLocator": "dataset=aHR0cHM6Ly9icmFka2RpeG9uLnNoYXJlcG9pbnQuY29tL3NpdGVzL0V4YW1wbGVTb2x1dGlvbg==,id=JTI1MmZTaGFyZWQlMmJEb2N1bWVudHMlMjUyZkdlbmVyYWwlMjUyZkJvb2sueGxzeA==",
"LastModifiedBy": null
},
{
"Id": "%252fShared%2bDocuments%252fGeneral%252fTest%2bDocument.docx",
"Name": "Test Document.docx",
"DisplayName": "Test Document.docx",
"Path": "/Shared Documents/General/Test Document.docx",
"LastModified": "2021-12-30T11:49:28Z",
"Size": 17959,
"MediaType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"IsFolder": false,
"ETag": "\"{7A3C7133-02FC-4A63-9A58-E11A815AB351},8\"",
"FileLocator": "dataset=aHR0cHM6Ly9icmFka2RpeG9u etc",
"LastModifiedBy": null
},
{
"Id": "%252fShared%2bDocuments%252fGeneral%252fHierarchy.xlsx",
"Name": "Hierarchy.xlsx",
"DisplayName": "Hierarchy.xlsx",
"Path": "/Shared Documents/General/Hierarchy.xlsx",
"LastModified": "2022-01-07T02:49:38Z",
"Size": 41719,
"MediaType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"IsFolder": false,
"ETag": "\"{C919454C-48AB-4897-AD8C-E3F873B52E50},72\"",
"FileLocator": "dataset=aHR0cHM6Ly9icmFka2RpeG9uL etc",
"LastModifiedBy": null
}
]
Output - Folder 2
[
{
"Id": "%252fShared%2bDocuments%252fTest%2bFolder%252fTest.xlsx",
"Name": "Test.xlsx",
"DisplayName": "Test.xlsx",
"Path": "/Shared Documents/Test Folder/Test.xlsx",
"LastModified": "2022-01-09T11:08:31Z",
"Size": 17014,
"MediaType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"IsFolder": false,
"ETag": "\"{CCF71CE7-89E7-4F89-B5CB-0F078E22C951},163\"",
"FileLocator": "dataset=aHR0cHM6Ly9icmFka2RpeG9u etc",
"LastModifiedBy": null
},
{
"Id": "%252fShared%2bDocuments%252fTest%2bFolder%252fDocument.docx",
"Name": "Document.docx",
"DisplayName": "Document.docx",
"Path": "/Shared Documents/Test Folder/Document.docx",
"LastModified": "2022-01-09T11:08:16Z",
"Size": 17293,
"MediaType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"IsFolder": false,
"ETag": "\"{317C5767-04EC-4264-A58B-27A3FA8E4DF3},3\"",
"FileLocator": "dataset=aHR0cHM6Ly9icmFka2RpeG etc",
"LastModifiedBy": null
}
]
From here, just process each file individually using one of the files actions like in the first scenario above.
Note: You'll need to work through sub folders and recursion. There doesn't appear to be a way to do that easily.
You've provided very little information but it should be enough for you to adapt it accordingly.
Also, I strongly recommend you use a means other than a hardcoded JSON document in the action itself. There are way better means for housing that information which wouldn't result in a need to update the action itself everytime you want to add or delete a file.
The concept of the loop and and the expressions are the most important part to grasp as they will give you what you want.

Azure Data Factory v2 using utcnow() as a pipeline parameter

For context, I currently have a Data Factory v2 pipeline with a ForEach Activity that calls a Copy Activity. The Copy Activity simply copies data from an FTP server to a blob storage container.
Here is the pipeline json file :
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "ForEach1",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.InputParams",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "FtpDataset",
"type": "DatasetReference",
"parameters": {
"FtpFileName": "#item().FtpFileName",
"FtpFolderPath": "#item().FtpFolderPath"
}
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference",
"parameters": {
"BlobFileName": "#item().BlobFileName",
"BlobFolderPath": "#item().BlobFolderPath"
}
}
]
}
]
}
}
],
"parameters": {
"InputParams": {
"type": "Array",
"defaultValue": [
{
"FtpFolderPath": "/Folder1/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile1"
},
{
"FtpFolderPath": "/Folder2/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile2"
}
]
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
The issue I am having is that when specifying pipeline parameters, it seems I cannot use system variables and functions the same way I can when for example specifying folder paths for a blob storage dataset.
The consequence of this is that formatDateTime(utcnow(), 'yyyyMMdd') is not being interpreted as function calls but rather the actual string with value formatDateTime(utcnow(), 'yyyyMMdd').
To counter this I am guessing I should be using a trigger to execute my pipeline and pass the trigger's execution time as a parameter to the pipeline like trigger().startTime but is this the only way? Am I simply doing something wrong in my pipeline's JSON?
This should work:
File_#{formatDateTime(utcnow(), 'yyyyMMdd')}
Or complex paths as well:
rootfolder/subfolder/#{formatDateTime(utcnow(),'yyyy')}/#{formatDateTime(utcnow(),'MM')}/#{formatDateTime(utcnow(),'dd')}/#{formatDateTime(utcnow(),'HH')}
You can't put a dynamic expression in the default value. You should define this expression and function either when you creating a trigger, or when you define dataset parameters in sink/source in copy activity.
So either you create dataset property FtpFileName with some default value in source dataset, and then in copy activity, you can in source category to specify that dynamic expression.
Another way is to define pipeline parameter, and then to add dynamic expression to that pipeline parameter when you are defining a trigger. Hope this is a clear answer to you. :)
Default value of parameters cannot be expressions. They must be literal strings.
You could use trigger to achieve this. Or you could extract the common part of your expressions and just put literal values into the foreach items.

How to recognize file pattern in azure blob input dataset

This is my file pattern: adm_domain_20180401, adm_domain_20180402, these are from one particular source. same folder also contains adm_agent_20180401, adm_agent_20180402. I want to only copy files from blob to ADL with prefix adm_domain, is there any way to define the file pattern in input data set?
DATASET:
{
"name": "CgAdmDomain",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "flk_blob_dev_ls",
"typeProperties": {
"folderPath": "incoming/{Date}/",
"format": {
"type": "TextFormat"
},
"partitionedBy": [
{
"name": "Date",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Are you using ADF V1 or V2? We are working on adding filename wildcard support in ADF V2.
The fileFilter is not available for Azure Blob Storage. If you are looking files at on-premise then you will be able to achieve this by specifying a filter to be used to select a subset of files in the folderPath rather than all files - link
To solely achieve this for Azure Blob Storage use Azure Data Factory Custom activities. Implement the logic through custom code (.NET) and have it as an activity in the pipeline. More info about how to use custom activites - further reading.

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.
Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.
Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity
I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

How to define a Table in Azure Data Factory

When creating a HDInsight On Demand Linked Resource, the Data Factory creates a new container for the hdinsight. I wonder to know how I can creates a Table that points to that container? Here is my Table definition
{
"name": "AzureBlobLocation",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobLinkedService",
"typeProperties": {
"folderPath": "????/Folder1/Folder2/Output/Aggregation/",
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
What should goes instead of '????' that I put in there? the keyword in not accepted.
I should use the keyword 'container' in order to point to the working container.

Resources