Azure Data Factory - runs all the time?

Azure Data Factory - runs all the time? - azure

I've implemented a Azure DF Job which executes a SQL Stored Proc:
{
"name": "spLoggingProc",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "logging"
},
"outputs": [
{
"name": "spEmptyOutput15-4"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "spLogging"
}
],
"start": "2017-01-01T00:00:00Z",
"end": "2099-01-01T00:10:00Z",
"isPaused": false,
"hubName": "dwh_hub",
"pipelineMode": "Scheduled"
}
}
The dataset:
{
"name": "spEmptyOutput15-4",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "DWH",
"typeProperties": {
"tableName": "spEmptyOutput15-4"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
The problem is now, the Proc runs every 2-3 seconds. But frequency is set to every hour. My goal is, to run every hour and every day the proc.
Can anyone please help me?
Thanks a lot!

Please change the start time to today's date and you will not see the issue. Because you have set the start time to start of year, it will run for each day and each hour so it keep on running for 24x166 times before coming to normal routine. Its still running on hourly basis but it has to complete the past runs for each hour, you will see that its running every few seconds. I am sure that your proc is just taking 1-2 seconds to complete.

There is another way to run 10 slices (10 is Maximum value) parallelly to increase the rate. If you want the past data also. Then this will be helpful.
Change the Concurrency value 3 under Policy to run the slices in parallel.
"policy": {
"concurrency": 3,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "00:10:00"
}

Related

How to set the ADF V1 pipeline to run every Tuesday

I am using ADF V1 in Azure.
I want my pipeline to run every Tuesday at 10:00AM. I know how to set the time but how to set particular day of the week in dataset and pipeline?.
I want my pipeline to run every Tuesday 10:00 AM.
my sample data set
{
"$schema": "http://datafactories.schema.management.azure.com/internalschemas/2015-09-01/Microsoft.DataFactory.table.json",
"name": "SQL-My-Table-DS",
"properties": {
"structure": [
{
"name": "ServiceName",
"type": "String"
}
],
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "MyLinkedService",
"typeProperties": {
"tableName": "[common].[MyTable_Staging]"
},
"availability": {
"frequency": "Week",
"interval": 1,
"offset": "00:00:10"
},
"external": false,
"policy": {}
}
}

If you are using data factory version 1, you can achieve this by setting the availability with frequency month, interval 1, and set the offset with the number of the day you want the pipeline to run.
For example if you want it to run the 9th of each month as you said, you will have something like this:
"availability": {
"frequency": "Month",
"interval": 1,
"offset": "9.00:00:00",
"style": "StartOfInterval"
}
Editing the answer for week also, below code snippet will make pipeline to run every Tuesday.
"availability": {
"frequency": "Week",
"interval": 1,
"offset": "2.00:00:00",
"style": "StartOfInterval"
}

Azure Data Factory copy pipeline: get a file with date of yesterday that arrived today

I am getting a txt file (on todays date) with the date of yesterday in it and I want dynamically get this filename in my data factory pipeline.
The file is placed automatically on a file system and I want to copy this file to the blob store In my example below I am simulating this by copying from blob to blob.
For example:
filename_2018-02-11.txt arrived today (2018-03-12) with the date of yesterday(2018-02-11). How can I pick this file up on today's date?
Yesterday's slice did run but there was not a file yet.
Here is my example:
{"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "CopyPipeline-fromBlobToBlob",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"copyBehavior": "",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"enableSkipIncompatibleRow": true
},
"inputs": [
{
"name": "InputDataset-1"
}
],
"outputs": [
{
"name": "OutputDataset-1"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "05:00:00"
},
"name": "activity_00"
}
],
"start": "2018-03-07T00:00:00Z",
"end": "2020-03-08T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}

You can use EndOfInterval instead of StartOfInterval in the policy. That will use the end of the day instead of the start of the day to do the execution. You may also want to set the appropriate offset if the file is not available at midnight.

In ADF v2 you can use inbuilt variables (#pipeline().TriggerTime):
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
And in a source data set (InputDataset-1) put file path/file name as something like this:
#concat('YOUR BASE PATH IN BLOB/', 'filename_',
addhours(pipeline().TriggerTime, -1, 'yyyy'), '-',
addhours(pipeline().TriggerTime, -1, 'MM'), '-',
addhours(pipeline().TriggerTime, -1, 'dd'), '.txt'))
You can also use #trigger().scheduledTime
To have always the same date when e.g. pipeline will fail.
But remember that it is only available in trigger scope.
In my tests it was only evaluated for Schedule trigger.

How to create a periodic copy of data from OData to Azure DocumentDB

I am trying to make a periodic copy of all the data returning from an OData query into a documentDB collection, on a daily basis.
The copy works fine using the copy wizard, which is A REALLY GREAT option for simple tasks.  Thanks for that.
What isn't working for me though:  The copy just adds data each time, and I have NO WAY that I can SEE with a documentDB sink to "pre-delete" the data in the collection (compare to the SQL sink which has sqlWriterCleanupScript, which I could set to something like Delete * from 'table').
I know I can create an Azure Batch and do what I need, but at this point, I'm not sure that it isn't better to do a function and forego the Azure Data Factory (ADF) for this move.  I'm using ADF for replicating on-prem SQL stuff just fine, because it has the writer cleanup script.
At this point, I'd like to just use DocumentDB but I don't see a way to do it given the way my data works.
Here's a look at my pipeline:
{
"name": "R-------ProjectToDocDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": " "
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
/// this is where a cleanup script would be great.
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "ProjectId:ProjectId,.....:CostClassification"
}
},
"inputs": [
{
"name": "InputDataset-shc"
}
],
"outputs": [
{
"name": "OutputDataset-shc"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-_Custom query_->---Project"
}
],
"start": "2017-04-26T20:13:27.683Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "r-----datafactory01_hub",
"pipelineMode": "Scheduled"
}
}
Perhaps there's an update in the pipeline that creates parity between SQL output and DocumentDB

Azure Data Factory did not support clean up script for DocDB today. It's something in our backlog. If you can describe a little bit more for the E2E scenario, could help us priorities. For example, why append to the same collection not work? Is that because there's no way to identify the incremental records after each run? For the clean up requirement, will that always be delete * or it might be based on time stamp, etc. Thanks. Before the support for clean up script was there, custom activity was the only way to workaround now, sorry.

You could use a Logic App that runs on a Timer Trigger.

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.

Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.

Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity

I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

Call stored procedure using ADF

I am loading SQL server table using ADF and after insertion is over, I have to do little manipulation using below approach
Trigger (After insert) - Failed, SQL server not able to detect inserted record that I push using ADF.. **Seems to be a bug**.
Stored procedure using user defined table type - Getting error
Error Number '156'. Error message from database execution : Incorrect
syntax near the keyword 'select'. Must declare the table variable
"#a".
I have created below pipeline
{
"name": "CopyPipeline-xxx",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "sp_xxx",
"storedProcedureParameters": {
"stringProductData": {
"value": "str1"
}
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "col1:col1,col2:col2"
}
},
"inputs": [
{
"name": "InputDataset-3jg"
}
],
"outputs": [
{
"name": "OutputDataset-3jg"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 8
},
"name": "Activity-0-xxx_csv->[dbo]_[xxx_staging]"
}
],
"start": "2017-01-09T21:48:53.348Z",
"end": "2099-12-30T18:30:00Z",
"isPaused": false,
"hubName": "hub",
"pipelineMode": "Scheduled"
}
}
and using below stored procedure
create procedure [dbo].[sp_xxx] #xxx1 [dbo].[ut_xxx] READONLY, #str1 varchar(100) AS
MERGE xxx_dummy AS a
USING #xxx1 AS b
ON (a.col1 = b.col1)
WHEN NOT MATCHED
THEN INSERT(col1, col2)
VALUES(b.col1, b.col2)
WHEN MATCHED
THEN UPDATE SET a.col2 = b.col2;
Please help me to resolve the issue.

I can reproduce your first error. Inserting to a SQL Server table with Azure Data Factory (ADF) appears to use a bulk insert method (similar to BULK INSERT, bcp, SSIS etc) and by default these methods do not fire triggers:
insert bulk [dbo].[testADF] ([col1] Int, [col2] Int, [col3] Int, [col4] Int)
with (TABLOCK, CHECK_CONSTRAINTS)
With bcp, BULK INSERT there is a flag to change to say 'fire triggers' but it appears there is no way to change this setting for ADF. As a workaround, move the logic from your trigger into the stored proc.
If you believe this flag is important, consider creating a feedback item.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string