How to define a Table in Azure Data Factory

How to define a Table in Azure Data Factory - azure

When creating a HDInsight On Demand Linked Resource, the Data Factory creates a new container for the hdinsight. I wonder to know how I can creates a Table that points to that container? Here is my Table definition
{
"name": "AzureBlobLocation",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobLinkedService",
"typeProperties": {
"folderPath": "????/Folder1/Folder2/Output/Aggregation/",
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
What should goes instead of '????' that I put in there? the keyword in not accepted.

I should use the keyword 'container' in order to point to the working container.

Related

Azure Data Factory CI/CD for Schedule Triggers Not Working

For triggers it looks like only pipelines pipeline and typeProperties blocks can be overriden based on the documentation.
What I want to achieve is with my CI/CD process and overriding parameters functionality, to have a schedule trigger disabled in the target ADF, unlike my source ADF.
If I inspect the JSON of a trigger that looks like the following field could do the trick "runtimeState": "Started".
{
"name": "name_daily",
"properties": {
"description": " ",
"annotations": [],
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "name",
"type": "PipelineReference"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2020-05-05T13:01:00.000Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
1
],
"hours": [
13
]
}
}
}
}
}
But if I attempt to add it in the JSON file like this:
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"runtimeState": "-",
"typeProperties": {
"recurrence": {
"interval": "=",
"frequency": "="
}
}
}
}
it never shows up in the Override section in Azure Pipeline Releases.
Does this ADF CI/CD functionality exist for triggers? How can I achieve my target here?

Turns out runtimeState for triggers is not obeyed in the arm-template-parameters-definition.json.
The path is clearer after some more research - I can achieve what I want with either editing the Powershell script Microsoft has provided or use an ADF custom task from the Azure Devops marketplace.

Azure Data Factory v2 using utcnow() as a pipeline parameter

For context, I currently have a Data Factory v2 pipeline with a ForEach Activity that calls a Copy Activity. The Copy Activity simply copies data from an FTP server to a blob storage container.
Here is the pipeline json file :
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "ForEach1",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.InputParams",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "FtpDataset",
"type": "DatasetReference",
"parameters": {
"FtpFileName": "#item().FtpFileName",
"FtpFolderPath": "#item().FtpFolderPath"
}
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference",
"parameters": {
"BlobFileName": "#item().BlobFileName",
"BlobFolderPath": "#item().BlobFolderPath"
}
}
]
}
]
}
}
],
"parameters": {
"InputParams": {
"type": "Array",
"defaultValue": [
{
"FtpFolderPath": "/Folder1/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile1"
},
{
"FtpFolderPath": "/Folder2/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile2"
}
]
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
The issue I am having is that when specifying pipeline parameters, it seems I cannot use system variables and functions the same way I can when for example specifying folder paths for a blob storage dataset.
The consequence of this is that formatDateTime(utcnow(), 'yyyyMMdd') is not being interpreted as function calls but rather the actual string with value formatDateTime(utcnow(), 'yyyyMMdd').
To counter this I am guessing I should be using a trigger to execute my pipeline and pass the trigger's execution time as a parameter to the pipeline like trigger().startTime but is this the only way? Am I simply doing something wrong in my pipeline's JSON?

This should work:
File_#{formatDateTime(utcnow(), 'yyyyMMdd')}
Or complex paths as well:
rootfolder/subfolder/#{formatDateTime(utcnow(),'yyyy')}/#{formatDateTime(utcnow(),'MM')}/#{formatDateTime(utcnow(),'dd')}/#{formatDateTime(utcnow(),'HH')}

You can't put a dynamic expression in the default value. You should define this expression and function either when you creating a trigger, or when you define dataset parameters in sink/source in copy activity.
So either you create dataset property FtpFileName with some default value in source dataset, and then in copy activity, you can in source category to specify that dynamic expression.
Another way is to define pipeline parameter, and then to add dynamic expression to that pipeline parameter when you are defining a trigger. Hope this is a clear answer to you. :)

Default value of parameters cannot be expressions. They must be literal strings.
You could use trigger to achieve this. Or you could extract the common part of your expressions and just put literal values into the foreach items.

How to recognize file pattern in azure blob input dataset

This is my file pattern: adm_domain_20180401, adm_domain_20180402, these are from one particular source. same folder also contains adm_agent_20180401, adm_agent_20180402. I want to only copy files from blob to ADL with prefix adm_domain, is there any way to define the file pattern in input data set?
DATASET:
{
"name": "CgAdmDomain",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "flk_blob_dev_ls",
"typeProperties": {
"folderPath": "incoming/{Date}/",
"format": {
"type": "TextFormat"
},
"partitionedBy": [
{
"name": "Date",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}

Are you using ADF V1 or V2? We are working on adding filename wildcard support in ADF V2.

The fileFilter is not available for Azure Blob Storage. If you are looking files at on-premise then you will be able to achieve this by specifying a filter to be used to select a subset of files in the folderPath rather than all files - link
To solely achieve this for Azure Blob Storage use Azure Data Factory Custom activities. Implement the logic through custom code (.NET) and have it as an activity in the pipeline. More info about how to use custom activites - further reading.

Start Azure Datafactory (v2) Trigger

I have a ScheduleTrigger in Azure Datafactory. I cannot change the runtimestate to started. I have tried using StartWithHttpMessageAsync. (and what feels like all the other start commands in the API)
the json for the trigger looks like this:
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"startTime": "2017-12-15T12:00:00Z",
"endTime": "2099-12-31T00:00:00Z"
}
},
"pipelines": [
{
"pipelineReference": {
"referenceName": "DynamicFlowMaster",
"name": "StartMasterPipeline",
"type": "PipelineReference"
},
"parameters": {}
}
]
}
}

Ok, I finally answer...
While the documentation explicitly says:
"interval": <>, // optional, how often to fire (default to 1)
The Activity log in azure portal states:
'The template trigger is not valid: Required property 'interval' not found in JSON.
Adding the interval = 1 allowed the Client.Trigger.Start() - function to work, it previously only returned "Bad Request".

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.

Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.

Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity

I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string