I am in the process of integrating an existing azure data factory project in my solution. While observing the data factory pipelines I saw that all the pipelines use SqlSource and the destination is AzureQueueSink.
The input datasets are
1. on-prem table
2. The output of a stored procedure
The output is an azure sql table.
Now I am confused as to when to use this AzureQueueSink I checked on google but I did not find any information regarding the use case for this.
Below is the sample pipeline activity.
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "OnPremToAzureList",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.TruncateStgTable",
"storedProcedureParameters": { "TableName": "[dbo].[List]" }
},
"inputs": [
{
"name": "AzureSqlTableStart"
}
],
"outputs": [
{
"name": "AzureSqlTableTruncate"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "SPTruncateStgTable"
},
{
"name": "CopyActivityList",
"type": "Copy",
"inputs": [
{
"name": "OnPremList"
},
{
"name": "AzureSqlTableTruncate"
}
],
"outputs": [
{
"name": "AzureSqlTableList"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from dbo.List"
},
"sink": {
"type": "AzureQueueSink",
"writeBatchSize": 1000,
"writeBatchTimeout": "00:30:00"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 1,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
]
}
}
Any help is greatly appreciated.
Please do not use AzureQueueSink as copy into Azure Queue has not been shipped and we don't have any plan to bring it back. It's leaked into our Sdk/Schema by mistake :)
This sink type now gives you undeterministic behavior which happens to be working but that behavior is not to last too long.
Related
I am trying to use Azure Data Factory to read data from a FHIR server and transform the results into newline delimited JSON (ndjson) files in Azure Blob storage. Specifically, if you query a FHIR server, you might get something like:
{
"resourceType": "Bundle",
"id": "som-id",
"type": "searchset",
"link": [
{
"relation": "next",
"url": "https://fhirserver/?ct=token"
},
{
"relation": "self",
"url": "https://fhirserver/"
}
],
"entry": [
{
"fullUrl": "https://fhirserver/Organization/1234",
"resource": {
"resourceType": "Organization",
"id": "1234",
// More fields
},
{
"fullUrl": "https://fhirserver/Organization/456",
"resource": {
"resourceType": "Organization",
"id": "456",
// More fields
},
// More resources
]
}
Basically a bundle of resources. I would like to transform that into a newline delimited (aka ndjson) file where each line is just the json for a resource:
{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources
I am able to get the REST connector set up and it can query the FHIR server (including pagination), but no matter what I try I cannot seem to generate the ouput I want. I set up an Azure Blob storage dataset:
{
"name": "AzureBlob1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": "myout.json",
"folderPath": "outfhirfromadf"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
And configure a copy activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"resource": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
But at the end (in spite of configuring the schema mapping), it the end result in the blob is always just the original bundle returned from the server. If I configure the output blob as being a comma delimited text, I can extract fields and create a flattened tabular view, but that is not really what I want.
Any suggestions would be much appreciated.
So I sort of found a solution. If I do the original step of converting where the bundles are simply dumped in the JSON file and then do a nother conversion from the JSON file to what I pretend to be a text file into another blob, I can get the njson file created.
Basically, define another blob dataset:
{
"name": "AzureBlob2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"structure": [
{
"name": "Prop_0",
"type": "String"
}
],
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "",
"quoteChar": "",
"nullValue": "\\N",
"encodingName": null,
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": "myout.json",
"folderPath": "adfjsonout2"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Note that this one TextFormat and also note that the quoteChar is blank. If I then add another Copy Activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['resource']": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
},
{
"name": "Copy Data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"resource": "Prop_0"
}
}
},
"inputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob2",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Then it all works out. It is not ideal in that I now have two copies of the data in blobs, but one can easily be deleted, I suppose.
I would still love to hear about it if somebody has a one-step solution.
As briefly discussed in the comment, the Copy Activity does not provide much functionality aside from mapping data. As stated in the documentation, the Copy activity does the following operations:
Reads data from a source data store.
Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the
configurations of the input dataset, output dataset, and Copy
Activity.
Writes data to the sink/destination data store.
It does not look like that the Copy Activity does anything else aside from efficiently copying stuff around.
What I found out to be working was to use Databrick.
Here are the steps:
Add a Databricks account to your subscription;
Go to the Databricks page by clicking the authoring button;
Create a notebook;
Write the script (Scala, Python or .Net was recently announced).
The script would the following:
Read the data from the Blob storage;
Filter out & transform the data as needed;
Write the data back to a Blob storage;
You can test your script from there and, once ready, you can go back to your pipeline and create a Notebook activity that will point to your notebook containing the script.
I struggled coding in Scala but it was worth it :)
For anyone finding this post in the future you can just can use the $export api call to accomplish this. Note that you have to have a storage account linked to your Fhir server.
https://build.fhir.org/ig/HL7/bulk-data/export.html#endpoint---system-level-export
I've created an ADF pipeline with two linked activities, the first one to run a stored procedure and the 2nd activity (Copy Data) to copy data from a Azure DW to Azure SQL DB table. To link these two I have put the output dataset of the stored procedure as an input of the Copy Data activity even though that dataset is not used (there's a separate dataset for that).
Now the issue is when I get into "Monitor and Manage" and run the 2nd activity with "Rerun with upstream in Pipeline" the 1st stored procedure activity runs quickly and then the 2nd activity waits for about 5 mins before changing to In Progress. Why is this happening? Is it due to some time slicing issue? The Pipeline code is as below:
{
"name": "RunADLAProc",
"properties": {
"description": "This will run the procedure for ADLA",
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "dbo.BackUpDatabaseLog",
"storedProcedureParameters": {}
},
"outputs": [
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "StoredProcedureActivityTemplate"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "select * from dbo.DatabaseLog"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60.00:00:00"
}
},
"inputs": [
{
"name": "AzureSqlDWInput"
},
{
"name": "AzureSQLDatasetOutputforProc"
}
],
"outputs": [
{
"name": "AzureSQLDatasetOutput"
}
],
"policy": {
"timeout": "7.00:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "CopyActivityTemplate"
}
],
"start": "2018-05-09T00:00:00Z",
"end": "2018-05-12T00:00:00Z",
"isPaused": false,
"hubName": "testdatafactory-v1_hub",
"pipelineMode": "Scheduled"
}}
You can try ADFv2, it is more easily for debugging, and has a GUI authoring tool.
the UI tool is https://adf.azure.com
I was trying to use the Azure data factory to copy data from azure sql database to Azure search index.
I have created the pipeline, dataset and linked services correctly.
I am getting following error message after pipeline/activity execution:
Can't detect data source location. Please specify ExecutionLocation in CopyActivity.
Input Dataset
{
"name": "Input-notifyDB",
"properties": {
"structure": [
{
"name": "topicid",
"type": "String"
},
{
"name": "createdby",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-notifyDB",
"typeProperties": {},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Output Dataset:
{
"name": "OutputD-notifyDB",
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "sender",
"type": "String"
}
],
"published": false,
"type": "AzureSearchIndex",
"linkedServiceName": "Destination-notifyDB",
"typeProperties": {
"indexName": "test"
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": false,
"policy": {}
}
}
Pipeline:
{
"name": "Copy_notifyDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select convert(varchar(25), topicid) topicid, createdby from [vMessageDetail]"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "topicid:id,createdby:sender"
},
"parallelCopies": 8
},
"inputs": [
{
"name": "Input-notifyDB"
}
],
"outputs": [
{
"name": "OutputD-notifyDB"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "Activity-0-_Custom query_->test"
}
],
"start": "2017-05-22T10:17:00Z",
"end": "2017-05-23T18:30:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Any idea how to resolve this error?
The message means, in the copy activity within the pipeline, use the executionLocation property within typeProperties (peer of source and sink) to specify the region, like "executionLocation": "East US". See docs here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities#a-nameglobalaglobally-available-data-movement
Can anyone please tell me how to create a pipeline with two copy activities.
Copy activity1 is for InputDataset1 and OutputDataset1
Copy activity2 is for InputDataset2 and OutputDataset2
The pipeline should be scheduled to run both the activities at a time
You simply need to include 2 copy activities in the same pipeline.
Like this:
{
"name": "Copy2Things",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
},
}
"inputs": [
{
"name": "InputDataset1"
}
],
"outputs": [
{
"name": "OutputDataset1"
}
],
"policy": {
//etc...
},
"scheduler": {
"frequency": "Month",
"interval": 1,
"style": "EndOfInterval"
},
"name": "activity1"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": false
}
},
"inputs": [
{
"name": "InputDataset2"
}
],
"outputs": [
{
"name": "OutputDataset2"
}
],
"policy": {
//etc...
},
"scheduler": {
"frequency": "Month",
"interval": 1,
},
"name": "activity2"
}
//etc...
In terms of them running at the same time ADF will deal with that for you. Or put them in separate pipelines if you want to control each with start/stop/pause options.
Otherwise you can increase the activity concurrency value in the policy block if you want to copy multiple datasets within the scope of the defined time slices at the same time.
Example: InputDataset1, monthly slices, Jan, Feb, Mar, Apr. A concurrency of 2 will copy Jan & Feb, then Mar & Apr in parallel.
Hope this helps.
Error:
Copy activity met storage operation failure at 'Source' side. Error message from storage execution : Requested value 'AppendBlob' was not found..
I was originally trying to copy a blob to on prem sql and I received the above error. Just for testing I am trying to copy blob to blob and I still received the same error.
I can see the blob in my container. But it seems like azure data factory copy activity cannot access it. Do AppendBlobs work with azure data factory copy activity? Any one else run into this issue, any ideas on how to resolve?
Thanks.
Azure Data Factory JSON definition files:
InputBlob:
{
"name": "InputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "ContractHeader.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
OuputBlob:
{
"name": "OutputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "Sample.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "PipelineBlobToBlob",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputBlobTest"
}
],
"outputs": [
{
"name": "OutputBlobTest"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlobToBlob",
"description": ""
}
],
"start": "2015-07-12T00:00:00Z",
"end": "2015-07-12T01:00:00Z",
"isPaused": false
}
Received confirmation, Append blobs are currently not supported with azure data factory copy activity.