Azure Data Factory Pipeline + ML - azure

I am trying to do a pipeline in Azure Data factory V1 which will do an Azure Batch Execution on a file. I implemented it using a blob storage as input and output and it worked. However, I am not trying to change the input and output to a folder in my data lake store. When I try to deploy it, it gives me the following error:
Entity provisioning failed: AzureML Activity 'MLActivity' specifies 'DatalakeInput' in a property that requires an Azure Blob Dataset reference.
How can I have the input and output as a datalakestore instead of a blob?
Pipeline:
{
"name": "MLPipeline",
"properties": {
"description": "use AzureML model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {
"webServiceInput": "DatalakeInput",
"webServiceOutputs": {
"output1": "DatalakeOutput"
},
"webServiceInputs": {},
"globalParameters": {}
},
"inputs": [
{
"name": "DatalakeInput"
}
],
"outputs": [
{
"name": "DatalakeOutput"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MLActivity",
"description": "description",
"linkedServiceName": "MyAzureMLLinkedService"
}
],
"start": "2016-02-08T00:00:00Z",
"end": "2016-02-08T00:00:00Z",
"isPaused": false,
"hubName": "hubname",
"pipelineMode": "Scheduled"
}
}
Output dataset:
{
"name": "DatalakeOutput",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "AzureDataLakeStoreLinkedService",
"typeProperties": {
"folderPath": "/DATA_MANAGEMENT/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Input dataset:
{
"name": "DatalakeInput",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "AzureDataLakeStoreLinkedService",
"typeProperties": {
"fileName": "data.csv",
"folderPath": "/RAW/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
AzureDatalakeStoreLinkedService:
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"description": "",
"hubName": "xyzdatafactoryv1_hub",
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://xyzdatastore.azuredatalakestore.net/webhdfs/v1",
"authorization": "**********",
"sessionId": "**********",
"subscriptionId": "*****",
"resourceGroupName": "xyzresourcegroup"
}
}
}
The linked service was done following this tutorial based on data factory V1.

I assume there is some issue with AzureDataLakeStoreLinkedService. Please verify.
Depending on the authentication used for access data store, your AzureDataLakeStoreLinkedService json must look like below -
Using service principal authentication
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Using managed service identity authentication
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
This is Microsoft Document for reference - Copy data to or from Azure Data Lake Store by using Azure Data Factory

Related

Azure Data Factory v2 fails at sending data to Azure Search

I'm trying to transfer some data to Azure Search, but for some reason it fails with Invalid linked service reference. Name: AzureSearch1
I have set up an Azure Search Linked Service like this:
{
"name": "AzureSearch1",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://xxxxxx.search.windows.net",
"key": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "Search-AdminKey"
}
},
"connectVia": {
"referenceName": "integrationRuntime1",
"type": "IntegrationRuntimeReference"
}
}
}
"Test connection" works fine.
Now, I'm trying to create an Azure Search Indexer like this:
{
"name": "AzureSearchIndex_PriceSheet",
"properties": {
"linkedServiceName": {
"referenceName": "AzureSearch1",
"type": "LinkedServiceReference"
},
"type": "AzureSearchIndex",
"typeProperties": {
"indexName": "pricesheet"
}
}
}
but it fails when I click "Preview Data" or "Import Schema" with this error:
Invalid linked service reference. Name: AzureSearch1. Activity ID:2fa29fe9-ca5d-4308-af62-xxxxxxxxx
The integration pipeline is set to "West Europe" and Azure Search in provisioned in that region too.
Any thoughts?
Thanks!
I tried to reproduce your issue but failed.Please refer to my working configuration:
My Azure Search Linked Service:
{
"name": "AzureSearch1",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://***.search.windows.net",
"key": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault1",
"type": "LinkedServiceReference"
},
"secretName": "testas"
}
}
},
"type": "Microsoft.DataFactory/factories/linkedservices"
}
My Azure Search Indexer:
{
"name": "AzureSearchIndex1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureSearch1",
"type": "LinkedServiceReference"
},
"folder": {
"name": "azureSearch"
},
"type": "AzureSearchIndex",
"typeProperties": {
"indexName": "documentdb-index"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Perview Data:
I presumed it may because the secret which is stored in AKV has expired,then the link lost connection.I suggest you re-creating secret in AKV(just set default configuration) and try again.
Hope it helps you.Any concern,let me know.

Why are my Azure Data Factory Stored Procedures Timing Out after 30 Minutes?

Why are my Azure Data Factory Stored Procedures Timing Out after 30 Minutes?
Here is the Azure SQL Linked Service definition:
{
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:db1.database.windows.net,1433;Database=myDb;User ID=User;Password=***;Trusted_Connection=False;Encrypt=True;Connection Timeout=120"
}
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "shir-01"
}
},
"name": "LinkedService"
}
And, here is the Pipeline definition:
{
"name": BulkLoad11_Pipeline",
"properties": {
"activities": [
{
"name": "BulkLoad",
"type": "SqlServerStoredProcedure",
"policy": {
"timeout": {
"value": "1.00:00:00",
"type": "Expression"
},
"retry": 0,
"retryIntervalInSeconds": 30
},
"typeProperties": {
"storedProcedureName": "[dbo].[BulkLoad]",
"storedProcedureParameters": {
"FileName": {
"value": "FILE11",
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "LinkedService",
"type": "LinkedServiceReference"
}
}
]
}
}
When I trigger the Pipeline, it fails at 30 minutes with a TimeOut error:
Activity BulkLoad failed: Execution Timeout Expired. The timeout period
elapsed prior to completion of the operation or the server is not responding.

Can't detect data source location. Please specify ExecutionLocation in CopyActivity

I was trying to use the Azure data factory to copy data from azure sql database to Azure search index.
I have created the pipeline, dataset and linked services correctly.
I am getting following error message after pipeline/activity execution:
Can't detect data source location. Please specify ExecutionLocation in CopyActivity.
Input Dataset
{
"name": "Input-notifyDB",
"properties": {
"structure": [
{
"name": "topicid",
"type": "String"
},
{
"name": "createdby",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-notifyDB",
"typeProperties": {},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": true,
"policy": {}
}
}
Output Dataset:
{
"name": "OutputD-notifyDB",
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "sender",
"type": "String"
}
],
"published": false,
"type": "AzureSearchIndex",
"linkedServiceName": "Destination-notifyDB",
"typeProperties": {
"indexName": "test"
},
"availability": {
"frequency": "Minute",
"interval": 15
},
"external": false,
"policy": {}
}
}
Pipeline:
{
"name": "Copy_notifyDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select convert(varchar(25), topicid) topicid, createdby from [vMessageDetail]"
},
"sink": {
"type": "AzureSearchIndexSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "topicid:id,createdby:sender"
},
"parallelCopies": 8
},
"inputs": [
{
"name": "Input-notifyDB"
}
],
"outputs": [
{
"name": "OutputD-notifyDB"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "Activity-0-_Custom query_->test"
}
],
"start": "2017-05-22T10:17:00Z",
"end": "2017-05-23T18:30:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
Any idea how to resolve this error?
The message means, in the copy activity within the pipeline, use the executionLocation property within typeProperties (peer of source and sink) to specify the region, like "executionLocation": "East US". See docs here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities#a-nameglobalaglobally-available-data-movement

Azure data factory copy activity performance tuning

https://learn.microsoft.com/en-us/azure/data-factory/data-factory-load-sql-data-warehouse. According this link with 1000 DWU and polybase I should get 200MBps throughput. But I am getting 4.66 MBps. I have added user in xlargerc resource class to achieve best possible throughput from azure sql datawarehouse.
Below is the Pipeline JSON.
{
"name": "UCBPipeline-Copy",
"properties": {
"description": "pipeline with copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"cloudDataMovementUnits": 4
},
"inputs": [
{
"name": "USBBlob_Concept
}
],
"outputs": [
{
"name": "AzureDW_Concept"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureBlobtoSQLDW_Concept",
"description": "Copy Activity"
}
],
"start": "2017-02-28T18:00:00Z",
"end": "2017-03-01T19:00:00Z",
"isPaused": false,
"hubName": "sampledf1_hub",
"pipelineMode": "Scheduled"
}
}
Input dataset :
{
"name": "AzureBlob_Concept",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureZRSStorageLinkedService",
"typeProperties": {
"fileName": "conceptTab.txt",
"folderPath": "source/",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
output dataset:
{
"name": "AzureDW_Concept",
"properties": {
"published": false,
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "concept"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
is anything is missing in the configuration?
I took a look on runId "e98ac557-a507-4a6e-8833-978eff1723c3", which should belong to your Copy Activity. From our service logs, the source file is not large enough (270 MB in your case), so that the service call latency would make the throughput not good enough. You could try loading bigger files to have better throughput.

Data Factory Copy Activity met storage failure at source side - AppendBlob was not found

Error:
Copy activity met storage operation failure at 'Source' side. Error message from storage execution : Requested value 'AppendBlob' was not found..
I was originally trying to copy a blob to on prem sql and I received the above error. Just for testing I am trying to copy blob to blob and I still received the same error.
I can see the blob in my container. But it seems like azure data factory copy activity cannot access it. Do AppendBlobs work with azure data factory copy activity? Any one else run into this issue, any ideas on how to resolve?
Thanks.
Azure Data Factory JSON definition files:
InputBlob:
{
"name": "InputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "ContractHeader.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
OuputBlob:
{
"name": "OutputBlobTest",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "Sample.csv",
"folderPath": "testcontainer/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "PipelineBlobToBlob",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputBlobTest"
}
],
"outputs": [
{
"name": "OutputBlobTest"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlobToBlob",
"description": ""
}
],
"start": "2015-07-12T00:00:00Z",
"end": "2015-07-12T01:00:00Z",
"isPaused": false
}
Received confirmation, Append blobs are currently not supported with azure data factory copy activity.

Resources