Azure Data Factory Pipeline With 2 copy - azure

Can anyone please tell me how to create a pipeline with two copy activities.
Copy activity1 is for InputDataset1 and OutputDataset1
Copy activity2 is for InputDataset2 and OutputDataset2
The pipeline should be scheduled to run both the activities at a time

You simply need to include 2 copy activities in the same pipeline.
Like this:
{
"name": "Copy2Things",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
}
"inputs": [
{
"name": "InputDataset1"
}
],
"outputs": [
{
"name": "OutputDataset1"
}
],
"policy": {
//etc...
},
"scheduler": {
"frequency": "Month",
"interval": 1,
"style": "EndOfInterval"
},
"name": "activity1"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
}
},
"inputs": [
{
"name": "InputDataset2"
}
],
"outputs": [
{
"name": "OutputDataset2"
}
],
"policy": {
//etc...
},
"scheduler": {
"frequency": "Month",
"interval": 1,
},
"name": "activity2"
}
//etc...
In terms of them running at the same time ADF will deal with that for you. Or put them in separate pipelines if you want to control each with start/stop/pause options.
Otherwise you can increase the activity concurrency value in the policy block if you want to copy multiple datasets within the scope of the defined time slices at the same time.
Example: InputDataset1, monthly slices, Jan, Feb, Mar, Apr. A concurrency of 2 will copy Jan & Feb, then Mar & Apr in parallel.
Hope this helps.

Related

Azure Data Factory Pipepline with Multiple Downstream Activity Slow in Downstream Scheduler Starting

I've created an ADF pipeline with two linked activities, the first one to run a stored procedure and the 2nd activity (Copy Data) to copy data from a Azure DW to Azure SQL DB table. To link these two I have put the output dataset of the stored procedure as an input of the Copy Data activity even though that dataset is not used (there's a separate dataset for that).
Now the issue is when I get into "Monitor and Manage" and run the 2nd activity with "Rerun with upstream in Pipeline" the 1st stored procedure activity runs quickly and then the 2nd activity waits for about 5 mins before changing to In Progress. Why is this happening? Is it due to some time slicing issue? The Pipeline code is as below:
{
"name": "RunADLAProc",
"properties": {
"description": "This will run the procedure for ADLA",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.BackUpDatabaseLog",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "StoredProcedureActivityTemplate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "select * from dbo.DatabaseLog"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60.00:00:00"
}
},
"inputs": [
{
"name": "AzureSqlDWInput"
},
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"outputs": [
{
"name": "AzureSQLDatasetOutput"
}
],
"policy": {
"timeout": "7.00:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "CopyActivityTemplate"
}
],
"start": "2018-05-09T00:00:00Z",
"end": "2018-05-12T00:00:00Z",
"isPaused": false,
"hubName": "testdatafactory-v1_hub",
"pipelineMode": "Scheduled"
}}
You can try ADFv2, it is more easily for debugging, and has a GUI authoring tool.
the UI tool is https://adf.azure.com

Azure Data Factory Copy

I have a Azure Data Factory Copy Activity within a pipeline. The copy activity is working - but the data is copied multiple times. My data source is an Azure NoSQL DB. How do I configure the Copy Activity to Not Recopy a record?
Here is my Activity
{
"name": "Copy Usage Session Data",
"properties":
{
"description": "",
"activities":
[
{
"type": "Copy",
"typeProperties":
{
"source": {"type": "DocumentDbCollectionSource"},
"sink":
{
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "05:00:00",
"sliceIdentifierColumnName": "InstallationSliceIdentifier"
},
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "machineKey: machineKey, product: product, softwareVersion: softwareVersion, id: DocumentDBId"
}
},
"inputs": [{"name": "Machine Registration Input Data"}],
"outputs": [{"name": "Machine Registration Output Data"}],
"policy":
{
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "OldestFirst"
},
"scheduler":
{
"frequency": "Hour",
"interval": 1
},
"name": "Machine Registration Data To History",
"description": "Copy Machine Registration Data To SQL Server DB Activity"
},
{
"type": "Copy",
"typeProperties":
{
"source": {"type": "DocumentDbCollectionSource"},
"sink":
{
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "05:00:00",
"sliceIdentifierColumnName": "UsageSessionSliceIdentifier"
},
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "id: usageSessionId, usageInstallationId: usageInstallationId, startTime: startTime, stopTime: stopTime, currentVersion: currentVersion"
}
},
"inputs": [{"name": "Usage Session Input Data"}],
"outputs": [{"name": "Usage Session Output Data"}],
"policy":
{
"timeout": "01:00:00",
"concurrency": 2,
"executionPriorityOrder": "OldestFirst"
},
"scheduler":
{
"frequency": "Hour",
"interval": 1
},
"name": "Usage Session Data To History",
"description": "Copy Usage Session Data To SQL Server DB Activity"
}
],
"start": "2017-05-29T16:15:00Z",
"end": "2500-01-01T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Change pipeline start date as current date. If the pipeline start date is in the past then many data slices are created from that date to the current date and they will be copied. Also, you have set Concurrency : 2. This means 2 activities will run at a time.
e.g If your output dataset availability is 1 day and your pipeline start date is 29 - 05 -2017 then until today 16-06-2017 total 18 data slices will be created for each day. If you set the concurrency as 2 then 2 copy activities are run at a time. If Concurrency : 10 then 10 copy activities are run parallel.
Be careful with Output Dataset availability, Pipeline Start Date, Concurrency and Source Query.
example of a source query is $$Text.Format('select * from c where c.ModifiedDate >= \'{0:yyyy-MM-ddTHH:mm:ssZ}\' AND c.ModifiedDate < \'{1:yyyy-MM-ddTHH:mm:ssZ}\'', WindowStart, WindowEnd) Where ModifiedDate is a column which tell the time of document created in that particular collection.
Updated :
{
"name": "DocDbToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 1000,
"writeBatchTimeout": "00:00:59"
}
},
"inputs": [
{
"name": "PersonDocumentDbTable"
}
],
"outputs": [
{
"name": "PersonBlobTableOut"
}
],
"policy": {
"concurrency": 1
},
"name": "CopyFromDocDbToBlob"
}
],
"start": "2015-04-01T00:00:00Z",
"end": "2015-04-02T00:00:00Z"
}
}
Have a look at Data Factory scheduling and execution
For your Reference
You can use the query with created/modified date (it should exists in your table) and only pick the records for current date. This will be provided by slice start or end date and that way you can read only newly created records on daily basis.

Azure data factory copy activity performance tuning

https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.
Below is the Pipeline JSON.
{
"name": "UCBPipeline-Copy",
"properties": {
"description": "pipeline with copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"cloudDataMovementUnits": 4
},
"inputs": [
{
"name": "USBBlob_Concept
}
],
"outputs": [
{
"name": "AzureDW_Concept"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQLDW_Concept",
"description": "Copy Activity"
}
],
"start": "2017-02-28T18:00:00Z",
"end": "2017-03-01T19:00:00Z",
"isPaused": false,
"hubName": "sampledf1_hub",
"pipelineMode": "Scheduled"
}
}
Input dataset :
{
"name": "AzureBlob_Concept",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureZRSStorageLinkedService",
"typeProperties": {
"fileName": "conceptTab.txt",
"folderPath": "source/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
output dataset:
{
"name": "AzureDW_Concept",
"properties": {
"published": false,
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "concept"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
is anything is missing in the configuration?
I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.

Azure DataFactory chain activities

I'm very new to DataFactory and having issues understanding how to properly create a Pipeline that will execute a stored proc before performing a copy function.
The stored proc is simply a TRUNCATE of the destination table which is used as the output dataset in the second activity.
From the DataFactory docs, it tells me that to execute the stored proc first, specify the proc's "output" as the "input" of the second activity.
However, there's no real "output" from the stored proc. In order to get it to "work", I cloned the output of the second activity, changed the name of it and made it external=false to make it past the provisioning errors but that's obviously a total kludge.
It doesn't make sense to me that, at least in the case of a TRUNCATE action performed by this stored proc, why there would even need to be an output defined.
But, when I tried to use the output from the stored proc as an additional input, I received an error about having a duplicated table name.
How can I get the TRUNCATE stored proc activity to successfully execute (and complete) prior to running the copy activity?
Here's the pipeline code:
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Traffic CRM - SytemUser Stage"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"
}
}
Your SP activity output dataset i.e. "name": "Smart App - usp Truncate System User" should be input for the next activity. If you have confusion on what to put in dataset, just create a dummy dataset like below
{
"name": "DummySPDS",
"properties": {
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "SQLServerLS",
"typeProperties": {
"tableName": "dummyTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"IsExternal":"True"
}
}
Here is complete pipeline code
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"inputs": [
{
"name": "DummySPDS"
}
],
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"

When to use AzureQueueSink

I am in the process of integrating an existing azure data factory project in my solution. While observing the data factory pipelines I saw that all the pipelines use SqlSource and the destination is AzureQueueSink.
The input datasets are
1. on-prem table
2. The output of a stored procedure
The output is an azure sql table.
Now I am confused as to when to use this AzureQueueSink I checked on google but I did not find any information regarding the use case for this.
Below is the sample pipeline activity.
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "OnPremToAzureList",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.TruncateStgTable",
"storedProcedureParameters": { "TableName": "[dbo].[List]" }
},
"inputs": [
{
"name": "AzureSqlTableStart"
}
],
"outputs": [
{
"name": "AzureSqlTableTruncate"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SPTruncateStgTable"
},
{
"name": "CopyActivityList",
"type": "Copy",
"inputs": [
{
"name": "OnPremList"
},
{
"name": "AzureSqlTableTruncate"
}
],
"outputs": [
{
"name": "AzureSqlTableList"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from dbo.List"
},
"sink": {
"type": "AzureQueueSink",
"writeBatchSize": 1000,
"writeBatchTimeout": "00:30:00"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 1,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
Any help is greatly appreciated.
Please do not use AzureQueueSink as copy into Azure Queue has not been shipped and we don't have any plan to bring it back. It's leaked into our Sdk/Schema by mistake :)
This sink type now gives you undeterministic behavior which happens to be working but that behavior is not to last too long.

Resources