Schedule U-SQL jobs in Azure Data Factory

Schedule U-SQL jobs in Azure Data Factory - azure

I've got a following issue. I'd like to schedule three U-SQL jobs in following timing: 02:00UTC, 03:00UTC and 04:00UTC everyday. I know that by default, jobs in the pipeline are executed at 12:00AM UTC hence all my jobs run at the same time which is not what I want.
I red the documentation and it is written that I should consider offset parameter in dataset template. However when I try to set this the following error occurs: .
I do not knot how to set different than 12:00AM runtime of U-SQL job. Can You provide me some info on how to do that properly? In addition I attach my template of a dataset and a pipeline:
Dataset
{
"name": "TransformedData2",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "ADLstore_linkedService_scrapper",
"typeProperties": {
"fileName": "TestOutput2.csv",
"folderPath": "transformedData/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
}
}
}
Pipeline
{
"name": "filtering",
"properties": {
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "usqljobs\\cleanStatements.txt",
"scriptLinkedService": "AzureStorageLinkedService",
"degreeOfParallelism": 5,
"priority": 100,
"parameters": {}
},
"outputs": [
{
"name": "TransformedData2"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
},
"name": "Brajan filtering",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-07-02T09:50:00Z",
"end": "2018-06-30T03:00:00Z",
"isPaused": false,
"hubName": "datafactoryfin_hub",
"pipelineMode": "Scheduled"
}
}
Thanks

Using the Offset attribute can get a little messy as you'll need to re-provision time slices at the dataset level.
As an alternative I would suggest using the Delay attribute at for the activity. This gives more control and does not require time slices to be re-provisioned.
So in your JSON...
{
"name": "filtering",
"properties": {
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "usqljobs\\cleanStatements.txt",
"scriptLinkedService": "AzureStorageLinkedService",
"degreeOfParallelism": 5,
"priority": 100,
"parameters": {}
},
"outputs": [
{
"name": "TransformedData2"
}
],
"policy": {
"delay": "02:00:00" // <<<<< 2:00am start
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
},
"name": "Brajan filtering",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-07-02T09:50:00Z",
"end": "2018-06-30T03:00:00Z",
"isPaused": false,
"hubName": "datafactoryfin_hub",
"pipelineMode": "Scheduled"
}
Then you'll of course need additional activities for the 3:00am and 4:00am versions.
Check out this link for more info:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution
Delay is mentioned about a quarter of the way down the page.
Hope this helps

Related

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

I have a copy pipeline set up to copy a handful of files from a daily folder in an S3 bucket into a data lake in Azure using data factory. I run into this really weird issue.
Suppose there are three files in the S3 bucket. One is 30MB, another is 50MB, and the last is 70MB. If I put the 30M file 'first' (name it test0.tsv), then it claims it successfully copies all three files to ADLS. But, the second and third file are truncated to 30M. The data is correct for each file, but, it is truncated. If I put the 70M file first, then they are all copied properly. So, it is using the first file length as max file size and truncating all subsequent longer files. This is also very worrisome to me since the Azure Data Factory claims it successfully copied them.
Here is my pipeline:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
Here is my input dataset:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
Here is my output dataset:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I have removed the format fields in both the input and output dataset because I thought maybe having it be a binary copy would fix it, but that didn't work.

Can't detect data source location. Please specify ExecutionLocation in CopyActivity

I was trying to use the Azure data factory to copy data from azure sql database to Azure search index.
I have created the pipeline, dataset and linked services correctly.
I am getting following error message after pipeline/activity execution:
Can't detect data source location. Please specify ExecutionLocation in CopyActivity.
Input Dataset
{
"name": "Input-notifyDB",
"properties": {
"structure": [
{
"name": "topicid",
"type": "String"
},
{
"name": "createdby",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-notifyDB",
"typeProperties": {},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Output Dataset:
{
"name": "OutputD-notifyDB",
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "sender",
"type": "String"
}
],
"published": false,
"type": "AzureSearchIndex",
"linkedServiceName": "Destination-notifyDB",
"typeProperties": {
"indexName": "test"
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": false,
"policy": {}
}
}
Pipeline:
{
"name": "Copy_notifyDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select convert(varchar(25), topicid) topicid, createdby from [vMessageDetail]"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "topicid:id,createdby:sender"
},
"parallelCopies": 8
},
"inputs": [
{
"name": "Input-notifyDB"
}
],
"outputs": [
{
"name": "OutputD-notifyDB"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "Activity-0-_Custom query_->test"
}
],
"start": "2017-05-22T10:17:00Z",
"end": "2017-05-23T18:30:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Any idea how to resolve this error?

The message means, in the copy activity within the pipeline, use the executionLocation property within typeProperties (peer of source and sink) to specify the region, like "executionLocation": "East US". See docs here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities#a-nameglobalaglobally-available-data-movement

Azure data factory copy activity performance tuning

https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.
Below is the Pipeline JSON.
{
"name": "UCBPipeline-Copy",
"properties": {
"description": "pipeline with copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"cloudDataMovementUnits": 4
},
"inputs": [
{
"name": "USBBlob_Concept
}
],
"outputs": [
{
"name": "AzureDW_Concept"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQLDW_Concept",
"description": "Copy Activity"
}
],
"start": "2017-02-28T18:00:00Z",
"end": "2017-03-01T19:00:00Z",
"isPaused": false,
"hubName": "sampledf1_hub",
"pipelineMode": "Scheduled"
}
}
Input dataset :
{
"name": "AzureBlob_Concept",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureZRSStorageLinkedService",
"typeProperties": {
"fileName": "conceptTab.txt",
"folderPath": "source/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
output dataset:
{
"name": "AzureDW_Concept",
"properties": {
"published": false,
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "concept"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
is anything is missing in the configuration?

I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.

Azure DataFactory chain activities

I'm very new to DataFactory and having issues understanding how to properly create a Pipeline that will execute a stored proc before performing a copy function.
The stored proc is simply a TRUNCATE of the destination table which is used as the output dataset in the second activity.
From the DataFactory docs, it tells me that to execute the stored proc first, specify the proc's "output" as the "input" of the second activity.
However, there's no real "output" from the stored proc. In order to get it to "work", I cloned the output of the second activity, changed the name of it and made it external=false to make it past the provisioning errors but that's obviously a total kludge.
It doesn't make sense to me that, at least in the case of a TRUNCATE action performed by this stored proc, why there would even need to be an output defined.
But, when I tried to use the output from the stored proc as an additional input, I received an error about having a duplicated table name.
How can I get the TRUNCATE stored proc activity to successfully execute (and complete) prior to running the copy activity?
Here's the pipeline code:
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Traffic CRM - SytemUser Stage"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"
}
}

Your SP activity output dataset i.e. "name": "Smart App - usp Truncate System User" should be input for the next activity. If you have confusion on what to put in dataset, just create a dummy dataset like below
{
"name": "DummySPDS",
"properties": {
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "SQLServerLS",
"typeProperties": {
"tableName": "dummyTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"IsExternal":"True"
}
}
Here is complete pipeline code
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"inputs": [
{
"name": "DummySPDS"
}
],
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"

Data Factory Copy Activity met storage failure at source side - AppendBlob was not found

Error:
Copy activity met storage operation failure at 'Source' side. Error message from storage execution : Requested value 'AppendBlob' was not found..
I was originally trying to copy a blob to on prem sql and I received the above error. Just for testing I am trying to copy blob to blob and I still received the same error.
I can see the blob in my container. But it seems like azure data factory copy activity cannot access it. Do AppendBlobs work with azure data factory copy activity? Any one else run into this issue, any ideas on how to resolve?
Thanks.
Azure Data Factory JSON definition files:
InputBlob:
{
"name": "InputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "ContractHeader.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
OuputBlob:
{
"name": "OutputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "Sample.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "PipelineBlobToBlob",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputBlobTest"
}
],
"outputs": [
{
"name": "OutputBlobTest"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlobToBlob",
"description": ""
}
],
"start": "2015-07-12T00:00:00Z",
"end": "2015-07-12T01:00:00Z",
"isPaused": false
}

Received confirmation, Append blobs are currently not supported with azure data factory copy activity.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Schedule U-SQL jobs in Azure Data Factory - azure

Related

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

Can't detect data source location. Please specify ExecutionLocation in CopyActivity

Azure data factory copy activity performance tuning

Azure DataFactory chain activities

Data Factory Copy Activity met storage failure at source side - AppendBlob was not found

Categories

Resources