Azure Data Factory: StartDate in Default Parameters

Azure Data Factory: StartDate in Default Parameters - azure

I'm trying to specify a default value for startDate parameter as shown below:
I've been checking how to specify so with functions but so far have I have not find anything.
So far I'm specifying a manual value for startDate, but the idea is to get the current date every time the schedule runs.
I have this defined in a blob storage destination (where it is being used):
#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}#{formatDateTime(pipeline().parameters.windowStart,'MM')}#{formatDateTime(pipeline().parameters.windowStart,'dd')}#{formatDateTime(pipeline().parameters.windowStart,'HH')}#{formatDateTime(pipeline().parameters.windowStart,'mm'
Is there a way to replace the fact of calling the parameter and use directly utcnow() ?

You can use utcnow() function,
or if you will define trigger you can use trigger().startTime.
Other Date functions you can find here.

formatDateTime function you use where you want to return a string in specific date format.
if the format is not imported for you, and you just want the current date, then you can use trigger().startTime or utcnow() in expression field. Don't forget # sign.
trigger().startTime.utcnow is not valid expression.

Finally I was able to fix this by creating the trigger using a JSON code as shown below:
{
"name": "yourTriggerName",
"properties": {
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "YourPipelineName",
"type": "PipelineReference"
},
"parameters": {
"windowStart": "#trigger().scheduledTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2018-07-11T17:00:00Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
20
],
"hours": [
19
]
}
}
}
}
}
And making emphasis of course, in line below:
"parameters": {
"windowStart": "#trigger().scheduledTime"
After that, the copy activity started to work as expected.

Related

Azure Data Factory CI/CD for Schedule Triggers Not Working

For triggers it looks like only pipelines pipeline and typeProperties blocks can be overriden based on the documentation.
What I want to achieve is with my CI/CD process and overriding parameters functionality, to have a schedule trigger disabled in the target ADF, unlike my source ADF.
If I inspect the JSON of a trigger that looks like the following field could do the trick "runtimeState": "Started".
{
"name": "name_daily",
"properties": {
"description": " ",
"annotations": [],
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "name",
"type": "PipelineReference"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2020-05-05T13:01:00.000Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
1
],
"hours": [
13
]
}
}
}
}
}
But if I attempt to add it in the JSON file like this:
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"runtimeState": "-",
"typeProperties": {
"recurrence": {
"interval": "=",
"frequency": "="
}
}
}
}
it never shows up in the Override section in Azure Pipeline Releases.
Does this ADF CI/CD functionality exist for triggers? How can I achieve my target here?

Turns out runtimeState for triggers is not obeyed in the arm-template-parameters-definition.json.
The path is clearer after some more research - I can achieve what I want with either editing the Powershell script Microsoft has provided or use an ADF custom task from the Azure Devops marketplace.

Azure Data Factory v2 using utcnow() as a pipeline parameter

For context, I currently have a Data Factory v2 pipeline with a ForEach Activity that calls a Copy Activity. The Copy Activity simply copies data from an FTP server to a blob storage container.
Here is the pipeline json file :
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "ForEach1",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "#pipeline().parameters.InputParams",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "FtpDataset",
"type": "DatasetReference",
"parameters": {
"FtpFileName": "#item().FtpFileName",
"FtpFolderPath": "#item().FtpFolderPath"
}
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference",
"parameters": {
"BlobFileName": "#item().BlobFileName",
"BlobFolderPath": "#item().BlobFolderPath"
}
}
]
}
]
}
}
],
"parameters": {
"InputParams": {
"type": "Array",
"defaultValue": [
{
"FtpFolderPath": "/Folder1/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile1"
},
{
"FtpFolderPath": "/Folder2/",
"FtpFileName": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.txt')",
"BlobFolderPath": "blobfolderpath",
"BlobFileName": "blobfile2"
}
]
}
}
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
The issue I am having is that when specifying pipeline parameters, it seems I cannot use system variables and functions the same way I can when for example specifying folder paths for a blob storage dataset.
The consequence of this is that formatDateTime(utcnow(), 'yyyyMMdd') is not being interpreted as function calls but rather the actual string with value formatDateTime(utcnow(), 'yyyyMMdd').
To counter this I am guessing I should be using a trigger to execute my pipeline and pass the trigger's execution time as a parameter to the pipeline like trigger().startTime but is this the only way? Am I simply doing something wrong in my pipeline's JSON?

This should work:
File_#{formatDateTime(utcnow(), 'yyyyMMdd')}
Or complex paths as well:
rootfolder/subfolder/#{formatDateTime(utcnow(),'yyyy')}/#{formatDateTime(utcnow(),'MM')}/#{formatDateTime(utcnow(),'dd')}/#{formatDateTime(utcnow(),'HH')}

You can't put a dynamic expression in the default value. You should define this expression and function either when you creating a trigger, or when you define dataset parameters in sink/source in copy activity.
So either you create dataset property FtpFileName with some default value in source dataset, and then in copy activity, you can in source category to specify that dynamic expression.
Another way is to define pipeline parameter, and then to add dynamic expression to that pipeline parameter when you are defining a trigger. Hope this is a clear answer to you. :)

Default value of parameters cannot be expressions. They must be literal strings.
You could use trigger to achieve this. Or you could extract the common part of your expressions and just put literal values into the foreach items.

Start Azure Datafactory (v2) Trigger

I have a ScheduleTrigger in Azure Datafactory. I cannot change the runtimestate to started. I have tried using StartWithHttpMessageAsync. (and what feels like all the other start commands in the API)
the json for the trigger looks like this:
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"startTime": "2017-12-15T12:00:00Z",
"endTime": "2099-12-31T00:00:00Z"
}
},
"pipelines": [
{
"pipelineReference": {
"referenceName": "DynamicFlowMaster",
"name": "StartMasterPipeline",
"type": "PipelineReference"
},
"parameters": {}
}
]
}
}

Ok, I finally answer...
While the documentation explicitly says:
"interval": <>, // optional, how often to fire (default to 1)
The Activity log in azure portal states:
'The template trigger is not valid: Required property 'interval' not found in JSON.
Adding the interval = 1 allowed the Client.Trigger.Start() - function to work, it previously only returned "Bad Request".

Azure ADF sliceIdentifierColumnName is not populating correctly

I've set up a ADF pipeline using a sliceIdentifierColumnName which has worked well as it populated the field with a GUID as expected. Recently however this field stopped being populated, the refresh would work but the sliceIdentifierColumnName field would have a value of null, or occasionally the load would fail as it attempted to populate this field with a value of 1 which causes the slice load to fail.
This change occurred at a point in time, before it worked perfectly, after it repeatedly failed to populate the field correctly. I'm sure no changes were made to the Pipeline which caused this to suddenly fail. Any pointers where I should be looking?
Here an extract of the pipeline source, I'm reading from a table in Amazon Redshift and writing to an Azure SQL table.
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from mytable where eventtime >= \\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and eventtime < \\'{1:yyyy-MM-ddTHH:mm:ssZ}\\' ' , SliceStart, SliceEnd)"
},
"sink": {
"type": "SqlSink",
"sliceIdentifierColumnName": "ColumnForADFuseOnly",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftSomeName"
}
],
"outputs": [
{
"name": "AzureSQLDatasetSomeName"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 2
},
"name": "Activity-somename2Hour"
}
],
Also, here is the error output text
Copy activity encountered a user error at Sink:.database.windows.net side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'ColumnForADFuseOnly' contains an invalid value '1'.,Source=Microsoft.DataTransfer.Common,''Type=System.ArgumentException,Message=Type of value has a mismatch with column typeCouldn't store <1> in ColumnForADFuseOnly Column.
Expected type is Byte[].,Source=System.Data,''Type=System.ArgumentException,Message=Type of value has a mismatch with column type,Source=System.Data,'.
Here is part of the source dataset, it's a table with all datatypes as Strings.
{
"name": "AmazonRedshiftsomename_2hourly",
"properties": {
"structure": [
{
"name": "eventid",
"type": "String"
},
{
"name": "visitorid",
"type": "String"
},
{
"name": "eventtime",
"type": "Datetime"
}
}
Finally, the target table is identical to the source table, mapping each column name to its counterpart in Azure, with the exception of the additional column in Azure named
[ColumnForADFuseOnly] binary NULL,
It is this column which is now either being populated with NULLs or 1.
thanks,

You need to define [ColumnForADFuseOnly] as binary(32), binary with no length modifier is defaulting to a length of 1 and thus truncating your sliceIdentifier...
When n is not specified in a data definition or variable declaration statement, the default length is 1. When n is not specified with the CAST function, the default length is 30. See here

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.

Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.

Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity

I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string