Azure DataFactory chain activities - azure

I'm very new to DataFactory and having issues understanding how to properly create a Pipeline that will execute a stored proc before performing a copy function.
The stored proc is simply a TRUNCATE of the destination table which is used as the output dataset in the second activity.
From the DataFactory docs, it tells me that to execute the stored proc first, specify the proc's "output" as the "input" of the second activity.
However, there's no real "output" from the stored proc. In order to get it to "work", I cloned the output of the second activity, changed the name of it and made it external=false to make it past the provisioning errors but that's obviously a total kludge.
It doesn't make sense to me that, at least in the case of a TRUNCATE action performed by this stored proc, why there would even need to be an output defined.
But, when I tried to use the output from the stored proc as an additional input, I received an error about having a duplicated table name.
How can I get the TRUNCATE stored proc activity to successfully execute (and complete) prior to running the copy activity?
Here's the pipeline code:
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Traffic CRM - SytemUser Stage"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"
}
}

Your SP activity output dataset i.e. "name": "Smart App - usp Truncate System User" should be input for the next activity. If you have confusion on what to put in dataset, just create a dummy dataset like below
{
"name": "DummySPDS",
"properties": {
"published": false,
"type": "SqlServerTable",
"linkedServiceName": "SQLServerLS",
"typeProperties": {
"tableName": "dummyTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"IsExternal":"True"
}
}
Here is complete pipeline code
{
"name": "Traffic CRM - System User Stage",
"properties": {
"description": "Move System User to Stage",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.usp_Truncate_Traffic_Crm_SystemUser",
"storedProcedureParameters": {}
},
"inputs": [
{
"name": "DummySPDS"
}
],
"outputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Smart App - SystemUser Truncate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from [dbo].[Traffic_Crm_SystemUser]"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "All columns mapped here"
}
},
"inputs": [
{
"name": "Smart App - usp Truncate System User"
}
],
"outputs": [
{
"name": "Smart App - System User Stage Production"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-[dbo]_[Traffic_Crm_SystemUser]->[dbo]_[Traffic_Crm_SystemUser]"
}
],
"start": "2017-01-19T14:30:57.309Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "stagingdatafactory1_hub",
"pipelineMode": "Scheduled"

Related

Error while passing arguments to Pyspark Script in Azure data Factory

I am running a PySpark script from Azure Data Factory.
I have mentioned the arguments in the given section under Script/Jar as below.
The arguments are a Key Value pair.
Arguments are being submitted fine as seen below.
--arg '--APP_NAME ABC' --arg '--CONFIG_FILE_PATH wasbs://ABC --arg '--OUTPUT_INFO wasbs://XYZ
When the pipeline is executed I am getting the below Error.
usage: Data.py [-h] --CONFIG_FILE_PATH CONFIG_FILE_PATH --OUTPUT_INFO
OUTPUT_INFO --ACTION_CODE ACTION_CODE --RUN_ID RUN_ID
--APP_NAME APP_NAME --JOB_ID JOB_ID --TASK_ID TASK_ID
--PCS_ID PCS_ID --DAG_ID DAG_ID
Data.py: error: argument --CONFIG_FILE_PATH is required.
You can passing arguments to Pyspark Script in Azure data Factory.
Code:
{
"name": "SparkActivity",
"properties": {
"activities": [
{
"name": "Spark1",
"type": "HDInsightSpark",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"rootPath": "adftutorial/spark/script",
"entryFilePath": "WordCount_Spark.py",
"arguments": [
"--input-file",
"wasb://sampledata#chepra.blob.core.windows.net/data",
"--output-file",
"wasb://sampledata#chepra.blob.core.windows.net/results"
],
"sparkJobLinkedService": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "HDInsight",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
WALKTHROUGH for passing arguments in ADF:
Some of the example for passing the parameters in Azure Data Factory:
{
"name": "SparkSubmit",
"properties": {
"description": "Submit a spark job",
"activities": [
{
"type": "HDInsightMapReduce",
"typeProperties": {
"className": "com.adf.spark.SparkJob",
"jarFilePath": "libs/spark-adf-job-bin.jar",
"jarLinkedService": "StorageLinkedService",
"arguments": [
"--jarFile",
"libs/sparkdemoapp_2.10-1.0.jar",
"--jars",
"/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
"--mainClass",
"com.adf.spark.demo.Demo",
"--master",
"yarn-cluster",
"--driverMemory",
"2g",
"--driverExtraClasspath",
"/usr/lib/hdinsight-logging/*",
"--executorCores",
"1",
"--executorMemory",
"4g",
"--sparkHome",
"/usr/hdp/current/spark-client",
"--connectionString",
"DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
"input=wasb://input#<YOUR_ACCOUNT>.blob.core.windows.net/data",
"output=wasb://output#<YOUR_ACCOUNT>.blob.core.windows.net/results"
]
},
"inputs": [
{
"name": "input"
}
],
"outputs": [
{
"name": "output"
}
],
"policy": {
"executionPriorityOrder": "OldestFirst",
"timeout": "01:00:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Spark Launcher",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2015-11-16T00:00:01Z",
"end": "2015-11-16T23:59:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}

Azure Data Factory Pipepline with Multiple Downstream Activity Slow in Downstream Scheduler Starting

I've created an ADF pipeline with two linked activities, the first one to run a stored procedure and the 2nd activity (Copy Data) to copy data from a Azure DW to Azure SQL DB table. To link these two I have put the output dataset of the stored procedure as an input of the Copy Data activity even though that dataset is not used (there's a separate dataset for that).
Now the issue is when I get into "Monitor and Manage" and run the 2nd activity with "Rerun with upstream in Pipeline" the 1st stored procedure activity runs quickly and then the 2nd activity waits for about 5 mins before changing to In Progress. Why is this happening? Is it due to some time slicing issue? The Pipeline code is as below:
{
"name": "RunADLAProc",
"properties": {
"description": "This will run the procedure for ADLA",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.BackUpDatabaseLog",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "StoredProcedureActivityTemplate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "select * from dbo.DatabaseLog"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60.00:00:00"
}
},
"inputs": [
{
"name": "AzureSqlDWInput"
},
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"outputs": [
{
"name": "AzureSQLDatasetOutput"
}
],
"policy": {
"timeout": "7.00:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "CopyActivityTemplate"
}
],
"start": "2018-05-09T00:00:00Z",
"end": "2018-05-12T00:00:00Z",
"isPaused": false,
"hubName": "testdatafactory-v1_hub",
"pipelineMode": "Scheduled"
}}
You can try ADFv2, it is more easily for debugging, and has a GUI authoring tool.
the UI tool is https://adf.azure.com

Azure Data Factory Copy

I have a Azure Data Factory Copy Activity within a pipeline. The copy activity is working - but the data is copied multiple times. My data source is an Azure NoSQL DB. How do I configure the Copy Activity to Not Recopy a record?
Here is my Activity
{
"name": "Copy Usage Session Data",
"properties":
{
"description": "",
"activities":
[
{
"type": "Copy",
"typeProperties":
{
"source": {"type": "DocumentDbCollectionSource"},
"sink":
{
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "05:00:00",
"sliceIdentifierColumnName": "InstallationSliceIdentifier"
},
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "machineKey: machineKey, product: product, softwareVersion: softwareVersion, id: DocumentDBId"
}
},
"inputs": [{"name": "Machine Registration Input Data"}],
"outputs": [{"name": "Machine Registration Output Data"}],
"policy":
{
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "OldestFirst"
},
"scheduler":
{
"frequency": "Hour",
"interval": 1
},
"name": "Machine Registration Data To History",
"description": "Copy Machine Registration Data To SQL Server DB Activity"
},
{
"type": "Copy",
"typeProperties":
{
"source": {"type": "DocumentDbCollectionSource"},
"sink":
{
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "05:00:00",
"sliceIdentifierColumnName": "UsageSessionSliceIdentifier"
},
"translator":
{
"type": "TabularTranslator",
"ColumnMappings": "id: usageSessionId, usageInstallationId: usageInstallationId, startTime: startTime, stopTime: stopTime, currentVersion: currentVersion"
}
},
"inputs": [{"name": "Usage Session Input Data"}],
"outputs": [{"name": "Usage Session Output Data"}],
"policy":
{
"timeout": "01:00:00",
"concurrency": 2,
"executionPriorityOrder": "OldestFirst"
},
"scheduler":
{
"frequency": "Hour",
"interval": 1
},
"name": "Usage Session Data To History",
"description": "Copy Usage Session Data To SQL Server DB Activity"
}
],
"start": "2017-05-29T16:15:00Z",
"end": "2500-01-01T00:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Change pipeline start date as current date. If the pipeline start date is in the past then many data slices are created from that date to the current date and they will be copied. Also, you have set Concurrency : 2. This means 2 activities will run at a time.
e.g If your output dataset availability is 1 day and your pipeline start date is 29 - 05 -2017 then until today 16-06-2017 total 18 data slices will be created for each day. If you set the concurrency as 2 then 2 copy activities are run at a time. If Concurrency : 10 then 10 copy activities are run parallel.
Be careful with Output Dataset availability, Pipeline Start Date, Concurrency and Source Query.
example of a source query is $$Text.Format('select * from c where c.ModifiedDate >= \'{0:yyyy-MM-ddTHH:mm:ssZ}\' AND c.ModifiedDate < \'{1:yyyy-MM-ddTHH:mm:ssZ}\'', WindowStart, WindowEnd) Where ModifiedDate is a column which tell the time of document created in that particular collection.
Updated :
{
"name": "DocDbToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 1000,
"writeBatchTimeout": "00:00:59"
}
},
"inputs": [
{
"name": "PersonDocumentDbTable"
}
],
"outputs": [
{
"name": "PersonBlobTableOut"
}
],
"policy": {
"concurrency": 1
},
"name": "CopyFromDocDbToBlob"
}
],
"start": "2015-04-01T00:00:00Z",
"end": "2015-04-02T00:00:00Z"
}
}
Have a look at Data Factory scheduling and execution
For your Reference
You can use the query with created/modified date (it should exists in your table) and only pick the records for current date. This will be provided by slice start or end date and that way you can read only newly created records on daily basis.

Can't detect data source location. Please specify ExecutionLocation in CopyActivity

I was trying to use the Azure data factory to copy data from azure sql database to Azure search index.
I have created the pipeline, dataset and linked services correctly.
I am getting following error message after pipeline/activity execution:
Can't detect data source location. Please specify ExecutionLocation in CopyActivity.
Input Dataset
{
"name": "Input-notifyDB",
"properties": {
"structure": [
{
"name": "topicid",
"type": "String"
},
{
"name": "createdby",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-notifyDB",
"typeProperties": {},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Output Dataset:
{
"name": "OutputD-notifyDB",
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "sender",
"type": "String"
}
],
"published": false,
"type": "AzureSearchIndex",
"linkedServiceName": "Destination-notifyDB",
"typeProperties": {
"indexName": "test"
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": false,
"policy": {}
}
}
Pipeline:
{
"name": "Copy_notifyDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select convert(varchar(25), topicid) topicid, createdby from [vMessageDetail]"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "topicid:id,createdby:sender"
},
"parallelCopies": 8
},
"inputs": [
{
"name": "Input-notifyDB"
}
],
"outputs": [
{
"name": "OutputD-notifyDB"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "Activity-0-_Custom query_->test"
}
],
"start": "2017-05-22T10:17:00Z",
"end": "2017-05-23T18:30:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Any idea how to resolve this error?
The message means, in the copy activity within the pipeline, use the executionLocation property within typeProperties (peer of source and sink) to specify the region, like "executionLocation": "East US". See docs here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities#a-nameglobalaglobally-available-data-movement

Azure data factory copy activity performance tuning

https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.
Below is the Pipeline JSON.
{
"name": "UCBPipeline-Copy",
"properties": {
"description": "pipeline with copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"cloudDataMovementUnits": 4
},
"inputs": [
{
"name": "USBBlob_Concept
}
],
"outputs": [
{
"name": "AzureDW_Concept"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQLDW_Concept",
"description": "Copy Activity"
}
],
"start": "2017-02-28T18:00:00Z",
"end": "2017-03-01T19:00:00Z",
"isPaused": false,
"hubName": "sampledf1_hub",
"pipelineMode": "Scheduled"
}
}
Input dataset :
{
"name": "AzureBlob_Concept",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureZRSStorageLinkedService",
"typeProperties": {
"fileName": "conceptTab.txt",
"folderPath": "source/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
output dataset:
{
"name": "AzureDW_Concept",
"properties": {
"published": false,
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "concept"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
is anything is missing in the configuration?
I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.

Resources