Azure Data Factory Copy

Azure Data Factory Copy - azure

I have a Azure Data Factory Copy Activity within a pipeline. The copy activity is working - but the data is copied multiple times. My data source is an Azure NoSQL DB. How do I configure the Copy Activity to Not Recopy a record?
Here is my Activity
{
"name": "Copy Usage Session Data",
"properties":
{
"description": "",
"activities":
[
{
"type": "Copy",
"typeProperties":
{
"source": {"type": "DocumentDbCollectionSource"},
"sink":
{
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "05:00:00",
"sliceIdentifierColumnName": "InstallationSliceIdentifier"
},
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "machineKey: machineKey, product: product, softwareVersion: softwareVersion, id: DocumentDBId"
}
},
"inputs": [{"name": "Machine Registration Input Data"}],
"outputs": [{"name": "Machine Registration Output Data"}],
"policy":
{
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "OldestFirst"
},
"scheduler":
{
"frequency": "Hour",
"interval": 1
},
"name": "Machine Registration Data To History",
"description": "Copy Machine Registration Data To SQL Server DB Activity"
},
{
"type": "Copy",
"typeProperties":
{
"source": {"type": "DocumentDbCollectionSource"},
"sink":
{
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "05:00:00",
"sliceIdentifierColumnName": "UsageSessionSliceIdentifier"
},
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "id: usageSessionId, usageInstallationId: usageInstallationId, startTime: startTime, stopTime: stopTime, currentVersion: currentVersion"
}
},
"inputs": [{"name": "Usage Session Input Data"}],
"outputs": [{"name": "Usage Session Output Data"}],
"policy":
{
"timeout": "01:00:00",
"concurrency": 2,
"executionPriorityOrder": "OldestFirst"
},
"scheduler":
{
"frequency": "Hour",
"interval": 1
},
"name": "Usage Session Data To History",
"description": "Copy Usage Session Data To SQL Server DB Activity"
}
],
"start": "2017-05-29T16:15:00Z",
"end": "2500-01-01T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}

Change pipeline start date as current date. If the pipeline start date is in the past then many data slices are created from that date to the current date and they will be copied. Also, you have set Concurrency : 2. This means 2 activities will run at a time.
e.g If your output dataset availability is 1 day and your pipeline start date is 29 - 05 -2017 then until today 16-06-2017 total 18 data slices will be created for each day. If you set the concurrency as 2 then 2 copy activities are run at a time. If Concurrency : 10 then 10 copy activities are run parallel.
Be careful with Output Dataset availability, Pipeline Start Date, Concurrency and Source Query.
example of a source query is $$Text.Format('select * from c where c.ModifiedDate >= \'{0:yyyy-MM-ddTHH:mm:ssZ}\' AND c.ModifiedDate < \'{1:yyyy-MM-ddTHH:mm:ssZ}\'', WindowStart, WindowEnd) Where ModifiedDate is a column which tell the time of document created in that particular collection.
Updated :
{
"name": "DocDbToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 1000,
"writeBatchTimeout": "00:00:59"
}
},
"inputs": [
{
"name": "PersonDocumentDbTable"
}
],
"outputs": [
{
"name": "PersonBlobTableOut"
}
],
"policy": {
"concurrency": 1
},
"name": "CopyFromDocDbToBlob"
}
],
"start": "2015-04-01T00:00:00Z",
"end": "2015-04-02T00:00:00Z"
}
}
Have a look at Data Factory scheduling and execution
For your Reference

You can use the query with created/modified date (it should exists in your table) and only pick the records for current date. This will be provided by slice start or end date and that way you can read only newly created records on daily basis.

Related

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

I have a copy pipeline set up to copy a handful of files from a daily folder in an S3 bucket into a data lake in Azure using data factory. I run into this really weird issue.
Suppose there are three files in the S3 bucket. One is 30MB, another is 50MB, and the last is 70MB. If I put the 30M file 'first' (name it test0.tsv), then it claims it successfully copies all three files to ADLS. But, the second and third file are truncated to 30M. The data is correct for each file, but, it is truncated. If I put the 70M file first, then they are all copied properly. So, it is using the first file length as max file size and truncating all subsequent longer files. This is also very worrisome to me since the Azure Data Factory claims it successfully copied them.
Here is my pipeline:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
Here is my input dataset:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
Here is my output dataset:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I have removed the format fields in both the input and output dataset because I thought maybe having it be a binary copy would fix it, but that didn't work.

Azure Data Factory Pipepline with Multiple Downstream Activity Slow in Downstream Scheduler Starting

I've created an ADF pipeline with two linked activities, the first one to run a stored procedure and the 2nd activity (Copy Data) to copy data from a Azure DW to Azure SQL DB table. To link these two I have put the output dataset of the stored procedure as an input of the Copy Data activity even though that dataset is not used (there's a separate dataset for that).
Now the issue is when I get into "Monitor and Manage" and run the 2nd activity with "Rerun with upstream in Pipeline" the 1st stored procedure activity runs quickly and then the 2nd activity waits for about 5 mins before changing to In Progress. Why is this happening? Is it due to some time slicing issue? The Pipeline code is as below:
{
"name": "RunADLAProc",
"properties": {
"description": "This will run the procedure for ADLA",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.BackUpDatabaseLog",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "StoredProcedureActivityTemplate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "select * from dbo.DatabaseLog"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60.00:00:00"
}
},
"inputs": [
{
"name": "AzureSqlDWInput"
},
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"outputs": [
{
"name": "AzureSQLDatasetOutput"
}
],
"policy": {
"timeout": "7.00:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "CopyActivityTemplate"
}
],
"start": "2018-05-09T00:00:00Z",
"end": "2018-05-12T00:00:00Z",
"isPaused": false,
"hubName": "testdatafactory-v1_hub",
"pipelineMode": "Scheduled"
}}

You can try ADFv2, it is more easily for debugging, and has a GUI authoring tool.
the UI tool is https://adf.azure.com

Azure Data Factory Pipeline With 2 copy

Can anyone please tell me how to create a pipeline with two copy activities.
Copy activity1 is for InputDataset1 and OutputDataset1
Copy activity2 is for InputDataset2 and OutputDataset2
The pipeline should be scheduled to run both the activities at a time

You simply need to include 2 copy activities in the same pipeline.
Like this:
{
"name": "Copy2Things",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
}
"inputs": [
{
"name": "InputDataset1"
}
],
"outputs": [
{
"name": "OutputDataset1"
}
],
"policy": {
//etc...
},
"scheduler": {
"frequency": "Month",
"interval": 1,
"style": "EndOfInterval"
},
"name": "activity1"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
}
},
"inputs": [
{
"name": "InputDataset2"
}
],
"outputs": [
{
"name": "OutputDataset2"
}
],
"policy": {
//etc...
},
"scheduler": {
"frequency": "Month",
"interval": 1,
},
"name": "activity2"
}
//etc...
In terms of them running at the same time ADF will deal with that for you. Or put them in separate pipelines if you want to control each with start/stop/pause options.
Otherwise you can increase the activity concurrency value in the policy block if you want to copy multiple datasets within the scope of the defined time slices at the same time.
Example: InputDataset1, monthly slices, Jan, Feb, Mar, Apr. A concurrency of 2 will copy Jan & Feb, then Mar & Apr in parallel.
Hope this helps.

Azure DataFactory chain activities

I'm very new to DataFactory and having issues understanding how to properly create a Pipeline that will execute a stored proc before performing a copy function.
The stored proc is simply a TRUNCATE of the destination table which is used as the output dataset in the second activity.
From the DataFactory docs, it tells me that to execute the stored proc first, specify the proc's "output" as the "input" of the second activity.
However, there's no real "output" from the stored proc. In order to get it to "work", I cloned the output of the second activity, changed the name of it and made it external=false to make it past the provisioning errors but that's obviously a total kludge.
It doesn't make sense to me that, at least in the case of a TRUNCATE action performed by this stored proc, why there would even need to be an output defined.
But, when I tried to use the output from the stored proc as an additional input, I received an error about having a duplicated table name.
How can I get the TRUNCATE stored proc activity to successfully execute (and complete) prior to running the copy activity?
Here's the pipeline code:
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Traffic CRM - SytemUser Stage"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"
}
}

Your SP activity output dataset i.e. "name": "Smart App - usp Truncate System User" should be input for the next activity. If you have confusion on what to put in dataset, just create a dummy dataset like below
{
"name": "DummySPDS",
"properties": {
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "SQLServerLS",
"typeProperties": {
"tableName": "dummyTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"IsExternal":"True"
}
}
Here is complete pipeline code
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"inputs": [
{
"name": "DummySPDS"
}
],
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"

When to use AzureQueueSink

I am in the process of integrating an existing azure data factory project in my solution. While observing the data factory pipelines I saw that all the pipelines use SqlSource and the destination is AzureQueueSink.
The input datasets are
1. on-prem table
2. The output of a stored procedure
The output is an azure sql table.
Now I am confused as to when to use this AzureQueueSink I checked on google but I did not find any information regarding the use case for this.
Below is the sample pipeline activity.
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "OnPremToAzureList",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.TruncateStgTable",
"storedProcedureParameters": { "TableName": "[dbo].[List]" }
},
"inputs": [
{
"name": "AzureSqlTableStart"
}
],
"outputs": [
{
"name": "AzureSqlTableTruncate"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SPTruncateStgTable"
},
{
"name": "CopyActivityList",
"type": "Copy",
"inputs": [
{
"name": "OnPremList"
},
{
"name": "AzureSqlTableTruncate"
}
],
"outputs": [
{
"name": "AzureSqlTableList"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from dbo.List"
},
"sink": {
"type": "AzureQueueSink",
"writeBatchSize": 1000,
"writeBatchTimeout": "00:30:00"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 1,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
Any help is greatly appreciated.

Please do not use AzureQueueSink as copy into Azure Queue has not been shipped and we don't have any plan to bring it back. It's leaked into our Sdk/Schema by mistake :)
This sink type now gives you undeterministic behavior which happens to be working but that behavior is not to last too long.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure Data Factory Copy - azure

You can use the query with created/modified date (it should exists in your table) and only pick the records for current date. This will be provided by slice start or end date and that way you can read only newly created records on daily basis.

Related

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

Azure Data Factory Pipepline with Multiple Downstream Activity Slow in Downstream Scheduler Starting

Azure Data Factory Pipeline With 2 copy

Azure DataFactory chain activities

When to use AzureQueueSink

Categories

Resources