Hi I am using Azure Data Factory for a Copy activity.
I want the copy to be recursive across a container and it subfolders as follows:
myfolder/Year/Month/Day/Hour}/New_Generated_File.csv
The files that I am generating and importing into the folder have always a different name.
The problem is that activity seems to waiting for ever.
The pipeline is scheduled hourly.
I'm attaching the json code of the dataset and the linked service.
Dataset:
{
"name": "Txns_In_Blob",
"properties": {
"structure": [
{
"name": "Column0",
"type": "String"
},
[....Other Columns....]
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "LinkedService_To_Blob",
"typeProperties": {
"folderPath": "uploadtransactional/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/{Custom}.csv",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": " "
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
Linked Service:
{
"name": "LinkedService_To_Blob",
"properties": {
"description": "",
"hubName": "dataorchestrationsystem_hub",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=wizestorage;AccountKey=**********"
}
}
}
It is not mandatory to give the file name in the dataset's folderPath property. Just remove the file name and then all the files will be loaded by the datafactory for you.
{
"name": "Txns_In_Blob",
"properties": {
"structure": [
{
"name": "Column0",
"type": "String"
},
[....Other Columns....]
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "LinkedService_To_Blob",
"typeProperties": {
"folderPath": "uploadtransactional/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "%M" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "%d" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": " "
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
With the above folderPath it will generate the run time value
uploadtransactional/yearno=2016/monthno=05/dayno=30/hourno=07/ for a pipeline which executes UTC time zone now
Related
I try to migrate a pipeline that already exists in ADFV1 to ADFV2 and have some issues with the concept of triggers. My pipeline has two activiries, the first one is an Azure Data Lake Analytics activity and the second a copy activity.
The first activity runs a usql script where data is read from partioned folder /{yyyy}/{MM}/{dd}/, process it and write in folder /{yyyy}-{MM}-{dd}/.
Here are some JSON files from my factory (pipeline, trigger and datasets).
Pipeline:
{
"name": "StreamCompressionBlob2SQL",
"properties": {
"activities": [
{
"name": "compress",
"type": "DataLakeAnalyticsU-SQL",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"scriptPath": "d00044653/azure-configurations/usql-scripts/stream/compression.usql",
"scriptLinkedService": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"parameters": {
"Year": {
"value": "#formatDateTime(pipeline().parameters.windowStartTime,'yyyy')",
"type": "Expression"
},
"Month": {
"value": "#formatDateTime(pipeline().parameters.windowStartTime,'MM')",
"type": "Expression"
},
"Day": {
"value": "#formatDateTime(pipeline().parameters.windowStartTime,'dd')",
"type": "Expression"
}
}
},
"linkedServiceName": {
"referenceName": "AzureDataLakeAnalytics1",
"type": "LinkedServiceReference"
}
},
{
"name": "Blob2SQL",
"type": "Copy",
"dependsOn": [
{
"activity": "compress",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000
},
"enableStaging": false,
"dataIntegrationUnits": 0,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"tag": "TAG",
"device_id": "DEVICE_ID",
"system_id": "SYSTEM_ID",
"utc": "UTC",
"ts": "TS",
"median": "MEDIAN",
"min": "MIN",
"max": "MAX",
"avg": "AVG",
"stdev": "STDEV",
"first_value": "FIRST_VALUE",
"last_value": "LAST_VALUE",
"message_count": "MESSAGE_COUNT"
}
}
},
"inputs": [
{
"referenceName": "AzureBlobDataset_COMPRESSED_ASA_v1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDataset_T_ASSET_MONITORING_WARM_ASA_v1",
"type": "DatasetReference"
}
]
}
],
"parameters": {
"windowStartTime": {
"type": "String"
}
}
}
}
Trigger:
{
"name": "trigger1",
"properties": {
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "StreamCompressionBlob2SQL",
"type": "PipelineReference"
},
"parameters": {
"windowStartTime": "#trigger().scheduledTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2018-08-17T10:46:00.000Z",
"endTime": "2018-11-04T10:46:00.000Z",
"timeZone": "UTC"
}
}
}
}
Input Dataset for Copy Activity:
{
"name": "AzureBlobDataset_COMPRESSED_ASA_v1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"parameters": {
"Year": {
"type": "String",
"defaultValue": "#formatDateTime(pipeline().parameters.windowStartTime,'yyyy')"
},
"Month": {
"type": "String",
"defaultValue": "#formatDateTime(pipeline().parameters.windowStartTime,'yyyy')"
},
"Day": {
"type": "String",
"defaultValue": "#formatDateTime(pipeline().parameters.windowStartTime,'yyyy')"
}
},
"type": "AzureBlob",
"structure": [
{
"name": "tag",
"type": "String"
},
{
"name": "device_id",
"type": "String"
},
{
"name": "system_id",
"type": "String"
},
{
"name": "utc",
"type": "DateTime"
},
{
"name": "ts",
"type": "DateTime"
},
{
"name": "median",
"type": "Double"
},
{
"name": "min",
"type": "Double"
},
{
"name": "max",
"type": "Double"
},
{
"name": "avg",
"type": "Double"
},
{
"name": "stdev",
"type": "Double"
},
{
"name": "first_value",
"type": "Double"
},
{
"name": "last_value",
"type": "Double"
},
{
"name": "message_count",
"type": "Int16"
}
],
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ";",
"nullValue": "\\N",
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": true
},
"fileName": "",
"folderPath": {
"value": "#concat('d00044653/processed/stream/compressed',dataset().Year,'-',dataset().Month,'-',dataset().Day)",
"type": "Expression"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Output Dataset for Copy Activity:
{
"name": "AzureSQLDataset_T_ASSET_MONITORING_WARM_ASA_v1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureSqlDatabase1",
"type": "LinkedServiceReference"
},
"type": "AzureSqlTable",
"structure": [
{
"name": "TAG",
"type": "String"
},
{
"name": "DEVICE_ID",
"type": "String"
},
{
"name": "SYSTEM_ID",
"type": "String"
},
{
"name": "UTC",
"type": "DateTime"
},
{
"name": "TS",
"type": "DateTime"
},
{
"name": "MEDIAN",
"type": "Decimal"
},
{
"name": "MIN",
"type": "Decimal"
},
{
"name": "MAX",
"type": "Decimal"
},
{
"name": "AVG",
"type": "Decimal"
},
{
"name": "STDEV",
"type": "Decimal"
},
{
"name": "FIRST_VALUE",
"type": "Decimal"
},
{
"name": "LAST_VALUE",
"type": "Decimal"
},
{
"name": "MESSAGE_COUNT",
"type": "Int32"
}
],
"typeProperties": {
"tableName": "[dbo].[T_ASSET_MONITORING_WARM]"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
My problem is that after publishing nothing happens.
Any suggestions??
Schedule trigger do not support backfill scenario (based on your trigger definition - you are starting from August 17th 2018). In schedule trigger, pipeline runs can be executed only on time periods from the current time and the future.
In your case, for backfill scenarios use Tumbling window trigger.
I have a copy pipeline set up to copy a handful of files from a daily folder in an S3 bucket into a data lake in Azure using data factory. I run into this really weird issue.
Suppose there are three files in the S3 bucket. One is 30MB, another is 50MB, and the last is 70MB. If I put the 30M file 'first' (name it test0.tsv), then it claims it successfully copies all three files to ADLS. But, the second and third file are truncated to 30M. The data is correct for each file, but, it is truncated. If I put the 70M file first, then they are all copied properly. So, it is using the first file length as max file size and truncating all subsequent longer files. This is also very worrisome to me since the Azure Data Factory claims it successfully copied them.
Here is my pipeline:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
Here is my input dataset:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
Here is my output dataset:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I have removed the format fields in both the input and output dataset because I thought maybe having it be a binary copy would fix it, but that didn't work.
I was trying to use the Azure data factory to copy data from azure sql database to Azure search index.
I have created the pipeline, dataset and linked services correctly.
I am getting following error message after pipeline/activity execution:
Can't detect data source location. Please specify ExecutionLocation in CopyActivity.
Input Dataset
{
"name": "Input-notifyDB",
"properties": {
"structure": [
{
"name": "topicid",
"type": "String"
},
{
"name": "createdby",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-notifyDB",
"typeProperties": {},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Output Dataset:
{
"name": "OutputD-notifyDB",
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "sender",
"type": "String"
}
],
"published": false,
"type": "AzureSearchIndex",
"linkedServiceName": "Destination-notifyDB",
"typeProperties": {
"indexName": "test"
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": false,
"policy": {}
}
}
Pipeline:
{
"name": "Copy_notifyDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select convert(varchar(25), topicid) topicid, createdby from [vMessageDetail]"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "topicid:id,createdby:sender"
},
"parallelCopies": 8
},
"inputs": [
{
"name": "Input-notifyDB"
}
],
"outputs": [
{
"name": "OutputD-notifyDB"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "Activity-0-_Custom query_->test"
}
],
"start": "2017-05-22T10:17:00Z",
"end": "2017-05-23T18:30:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Any idea how to resolve this error?
The message means, in the copy activity within the pipeline, use the executionLocation property within typeProperties (peer of source and sink) to specify the region, like "executionLocation": "East US". See docs here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities#a-nameglobalaglobally-available-data-movement
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.
Below is the Pipeline JSON.
{
"name": "UCBPipeline-Copy",
"properties": {
"description": "pipeline with copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"cloudDataMovementUnits": 4
},
"inputs": [
{
"name": "USBBlob_Concept
}
],
"outputs": [
{
"name": "AzureDW_Concept"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQLDW_Concept",
"description": "Copy Activity"
}
],
"start": "2017-02-28T18:00:00Z",
"end": "2017-03-01T19:00:00Z",
"isPaused": false,
"hubName": "sampledf1_hub",
"pipelineMode": "Scheduled"
}
}
Input dataset :
{
"name": "AzureBlob_Concept",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureZRSStorageLinkedService",
"typeProperties": {
"fileName": "conceptTab.txt",
"folderPath": "source/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
output dataset:
{
"name": "AzureDW_Concept",
"properties": {
"published": false,
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "concept"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
is anything is missing in the configuration?
I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.
Error:
Copy activity met storage operation failure at 'Source' side. Error message from storage execution : Requested value 'AppendBlob' was not found..
I was originally trying to copy a blob to on prem sql and I received the above error. Just for testing I am trying to copy blob to blob and I still received the same error.
I can see the blob in my container. But it seems like azure data factory copy activity cannot access it. Do AppendBlobs work with azure data factory copy activity? Any one else run into this issue, any ideas on how to resolve?
Thanks.
Azure Data Factory JSON definition files:
InputBlob:
{
"name": "InputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "ContractHeader.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
OuputBlob:
{
"name": "OutputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "Sample.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "PipelineBlobToBlob",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputBlobTest"
}
],
"outputs": [
{
"name": "OutputBlobTest"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlobToBlob",
"description": ""
}
],
"start": "2015-07-12T00:00:00Z",
"end": "2015-07-12T01:00:00Z",
"isPaused": false
}
Received confirmation, Append blobs are currently not supported with azure data factory copy activity.