Azure data factory copy activity performance tuning

Azure data factory copy activity performance tuning - azure

https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.
Below is the Pipeline JSON.
{
"name": "UCBPipeline-Copy",
"properties": {
"description": "pipeline with copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"cloudDataMovementUnits": 4
},
"inputs": [
{
"name": "USBBlob_Concept
}
],
"outputs": [
{
"name": "AzureDW_Concept"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQLDW_Concept",
"description": "Copy Activity"
}
],
"start": "2017-02-28T18:00:00Z",
"end": "2017-03-01T19:00:00Z",
"isPaused": false,
"hubName": "sampledf1_hub",
"pipelineMode": "Scheduled"
}
}
Input dataset :
{
"name": "AzureBlob_Concept",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureZRSStorageLinkedService",
"typeProperties": {
"fileName": "conceptTab.txt",
"folderPath": "source/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
output dataset:
{
"name": "AzureDW_Concept",
"properties": {
"published": false,
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "concept"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
is anything is missing in the configuration?

I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.

Related

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

I have a copy pipeline set up to copy a handful of files from a daily folder in an S3 bucket into a data lake in Azure using data factory. I run into this really weird issue.
Suppose there are three files in the S3 bucket. One is 30MB, another is 50MB, and the last is 70MB. If I put the 30M file 'first' (name it test0.tsv), then it claims it successfully copies all three files to ADLS. But, the second and third file are truncated to 30M. The data is correct for each file, but, it is truncated. If I put the 70M file first, then they are all copied properly. So, it is using the first file length as max file size and truncating all subsequent longer files. This is also very worrisome to me since the Azure Data Factory claims it successfully copied them.
Here is my pipeline:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
Here is my input dataset:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
Here is my output dataset:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I have removed the format fields in both the input and output dataset because I thought maybe having it be a binary copy would fix it, but that didn't work.

Schedule U-SQL jobs in Azure Data Factory

I've got a following issue. I'd like to schedule three U-SQL jobs in following timing: 02:00UTC, 03:00UTC and 04:00UTC everyday. I know that by default, jobs in the pipeline are executed at 12:00AM UTC hence all my jobs run at the same time which is not what I want.
I red the documentation and it is written that I should consider offset parameter in dataset template. However when I try to set this the following error occurs: .
I do not knot how to set different than 12:00AM runtime of U-SQL job. Can You provide me some info on how to do that properly? In addition I attach my template of a dataset and a pipeline:
Dataset
{
"name": "TransformedData2",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "ADLstore_linkedService_scrapper",
"typeProperties": {
"fileName": "TestOutput2.csv",
"folderPath": "transformedData/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
}
}
}
Pipeline
{
"name": "filtering",
"properties": {
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "usqljobs\\cleanStatements.txt",
"scriptLinkedService": "AzureStorageLinkedService",
"degreeOfParallelism": 5,
"priority": 100,
"parameters": {}
},
"outputs": [
{
"name": "TransformedData2"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
},
"name": "Brajan filtering",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-07-02T09:50:00Z",
"end": "2018-06-30T03:00:00Z",
"isPaused": false,
"hubName": "datafactoryfin_hub",
"pipelineMode": "Scheduled"
}
}
Thanks

Using the Offset attribute can get a little messy as you'll need to re-provision time slices at the dataset level.
As an alternative I would suggest using the Delay attribute at for the activity. This gives more control and does not require time slices to be re-provisioned.
So in your JSON...
{
"name": "filtering",
"properties": {
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "usqljobs\\cleanStatements.txt",
"scriptLinkedService": "AzureStorageLinkedService",
"degreeOfParallelism": 5,
"priority": 100,
"parameters": {}
},
"outputs": [
{
"name": "TransformedData2"
}
],
"policy": {
"delay": "02:00:00" // <<<<< 2:00am start
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
},
"name": "Brajan filtering",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-07-02T09:50:00Z",
"end": "2018-06-30T03:00:00Z",
"isPaused": false,
"hubName": "datafactoryfin_hub",
"pipelineMode": "Scheduled"
}
Then you'll of course need additional activities for the 3:00am and 4:00am versions.
Check out this link for more info:
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution
Delay is mentioned about a quarter of the way down the page.
Hope this helps

Can't detect data source location. Please specify ExecutionLocation in CopyActivity

I was trying to use the Azure data factory to copy data from azure sql database to Azure search index.
I have created the pipeline, dataset and linked services correctly.
I am getting following error message after pipeline/activity execution:
Can't detect data source location. Please specify ExecutionLocation in CopyActivity.
Input Dataset
{
"name": "Input-notifyDB",
"properties": {
"structure": [
{
"name": "topicid",
"type": "String"
},
{
"name": "createdby",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-notifyDB",
"typeProperties": {},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Output Dataset:
{
"name": "OutputD-notifyDB",
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "sender",
"type": "String"
}
],
"published": false,
"type": "AzureSearchIndex",
"linkedServiceName": "Destination-notifyDB",
"typeProperties": {
"indexName": "test"
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": false,
"policy": {}
}
}
Pipeline:
{
"name": "Copy_notifyDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select convert(varchar(25), topicid) topicid, createdby from [vMessageDetail]"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "topicid:id,createdby:sender"
},
"parallelCopies": 8
},
"inputs": [
{
"name": "Input-notifyDB"
}
],
"outputs": [
{
"name": "OutputD-notifyDB"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "Activity-0-_Custom query_->test"
}
],
"start": "2017-05-22T10:17:00Z",
"end": "2017-05-23T18:30:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Any idea how to resolve this error?

The message means, in the copy activity within the pipeline, use the executionLocation property within typeProperties (peer of source and sink) to specify the region, like "executionLocation": "East US". See docs here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities#a-nameglobalaglobally-available-data-movement

azure data factory recursive copy from container

Hi I am using Azure Data Factory for a Copy activity.
I want the copy to be recursive across a container and it subfolders as follows:
myfolder/Year/Month/Day/Hour}/New_Generated_File.csv
The files that I am generating and importing into the folder have always a different name.
The problem is that activity seems to waiting for ever.
The pipeline is scheduled hourly.
I'm attaching the json code of the dataset and the linked service.
Dataset:
{
"name": "Txns_In_Blob",
"properties": {
"structure": [
{
"name": "Column0",
"type": "String"
},
[....Other Columns....]
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "LinkedService_To_Blob",
"typeProperties": {
"folderPath": "uploadtransactional/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/{Custom}.csv",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": " "
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
Linked Service:
{
"name": "LinkedService_To_Blob",
"properties": {
"description": "",
"hubName": "dataorchestrationsystem_hub",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=wizestorage;AccountKey=**********"
}
}
}

It is not mandatory to give the file name in the dataset's folderPath property. Just remove the file name and then all the files will be loaded by the datafactory for you.
{
"name": "Txns_In_Blob",
"properties": {
"structure": [
{
"name": "Column0",
"type": "String"
},
[....Other Columns....]
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "LinkedService_To_Blob",
"typeProperties": {
"folderPath": "uploadtransactional/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "%M" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "%d" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": " "
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
With the above folderPath it will generate the run time value
uploadtransactional/yearno=2016/monthno=05/dayno=30/hourno=07/ for a pipeline which executes UTC time zone now

Data Factory Copy Activity met storage failure at source side - AppendBlob was not found

Error:
Copy activity met storage operation failure at 'Source' side. Error message from storage execution : Requested value 'AppendBlob' was not found..
I was originally trying to copy a blob to on prem sql and I received the above error. Just for testing I am trying to copy blob to blob and I still received the same error.
I can see the blob in my container. But it seems like azure data factory copy activity cannot access it. Do AppendBlobs work with azure data factory copy activity? Any one else run into this issue, any ideas on how to resolve?
Thanks.
Azure Data Factory JSON definition files:
InputBlob:
{
"name": "InputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "ContractHeader.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
OuputBlob:
{
"name": "OutputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "Sample.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "PipelineBlobToBlob",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputBlobTest"
}
],
"outputs": [
{
"name": "OutputBlobTest"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlobToBlob",
"description": ""
}
],
"start": "2015-07-12T00:00:00Z",
"end": "2015-07-12T01:00:00Z",
"isPaused": false
}

Received confirmation, Append blobs are currently not supported with azure data factory copy activity.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure data factory copy activity performance tuning - azure

Related

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

Schedule U-SQL jobs in Azure Data Factory

Can't detect data source location. Please specify ExecutionLocation in CopyActivity

azure data factory recursive copy from container

Data Factory Copy Activity met storage failure at source side - AppendBlob was not found

Categories

Resources