Execute a PySpark Job with Dependencies using Azure Data Factory V2

Execute a PySpark Job with Dependencies using Azure Data Factory V2 - azure

I'd like to execute a PySpark Job with dependencies (either egg or zip files) using Data Factory V2.
When running the command directly on the Head Node cluster (HD Insight) in the form of a spark-submit method it goes as following (and works):
spark-submit --py-files 0.3-py3.6.egg main.py 1
In Data Factory (V2) I tried to define the following:
{
"name": "dimension",
"properties": {
"activities": [{
"name": "Spark1",
"type": "HDInsightSpark",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"rootPath": "adfspark",
"entryFilePath": "main.py",
"getDebugInfo": "Always",
"sparkConfig": {
"spark.submit.pyFiles": "0.3-py3.6.egg"
},
"sparkJobLinkedService": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "hdinsightlinkedService",
"type": "LinkedServiceReference"
}
}
]
}
}
Tried to specified the exact path of the dependencies ("wasb://.../0.3-py3.6.egg" or adfspark/pyFiles/0.3-py3.6.egg) like suggested in this thread:
How to setup custom Spark parameter in HDInsights cluster with Data Factory
All this in the context that "adfspark" is the container and the dependencies are located in "pyFiles" folder much like it is suggested in Azure documentation:
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-transform-data-spark-powershell
Running the job just on the Head Node would be a sufficient start although distributed execution is the real target here

Related

Azure Data Factory V2 - Input and Output

I'm trying to reproduce the following architecture based on the following github repo: https://github.com/Azure/cortana-intelligence-price-optimization
The problem is the part linked to the ADF, since in the guide it uses the old version of ADF: I don't know how to map in ADF v2 the "input" and "output" properties of a single activity so that they point to a dataset.
The pipeline performs a spark activity that does nothing more than execute a python script, and then I think it should write data into the dataset I defined already.
Here is the json of the ADF V1 pipeline inside the guide, which I cannot replicate:
"activities": [
{
"type": "HDInsightSpark",
"typeProperties": {
"rootPath": "adflibs",
"entryFilePath": "Sales_Data_Aggregation_2.0_blob.py",
"arguments": [ "modelsample" ],
"getDebugInfo": "Always"
},
"outputs": [
{
"name": "BlobStoreAggOutput"
}
],
"policy": {
"timeout": "00:30:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AggDataSparkJob",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
},

The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. When you use an on-demand Spark linked service, Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then deletes the cluster once the processing is complete.
Upload "Sales_Data_Aggregation_2.0_blob.py" to storage account attached to the HDInsight cluster and the modify the sample definition of a spark activity and create a schedule trigger and run the code:
Here is the sample JSON definition of a Spark activity:
{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}
Hope this helps.

Azure data factory - intermittent 400 errors when writing to blob storage

I'm using data factory with blob storage.
I sometime get the below error intermittently - this can occur on different pipelines/data-sources. However I always get the same error, regardless of which task fails - 400 The specified block list is invalid.
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorBlobUploadFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error occurred when trying to upload blob 'https://blob.core.windows.net/', detailed message: The remote server returned an error: (400) Bad Request.,Source=,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.
Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage
This seems to be most common if there is more than one task running at a time that is writing data to the storage. Is there anything I can do to make this process more reliable? Is it possible something has been misconfigured? It's causing slices to fail in data factory, so I'd really love to know what I should be investigating.
A sample pipeline that has suffered from this issue:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "Pipeline",
"properties": {
"description": "Pipeline to copy Processed CSV from Data Lake to blob storage",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [ { "name": "DataLake" } ],
"outputs": [ { "name": "Blob" } ],
"policy": {
"concurrency": 10,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyActivity"
}
],
"start": "2016-02-28",
"end": "2016-02-29",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
I'm only using LRS standard storage, but I still wouldn't expect it to intermittently throw errors.
EDIT: adding linked service json
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.LinkedService.json",
"name": "Ls-Staging-Storage",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=;AccountKey="
}
}
}

Such error is mostly caused by racing issues. E.g. multiple concurrent activity runs write to the same blob file.
Could you further check your pipelines settings whether it is the case? And please avoid such setting if so.

How to create a periodic copy of data from OData to Azure DocumentDB

I am trying to make a periodic copy of all the data returning from an OData query into a documentDB collection, on a daily basis.
The copy works fine using the copy wizard, which is A REALLY GREAT option for simple tasks.  Thanks for that.
What isn't working for me though:  The copy just adds data each time, and I have NO WAY that I can SEE with a documentDB sink to "pre-delete" the data in the collection (compare to the SQL sink which has sqlWriterCleanupScript, which I could set to something like Delete * from 'table').
I know I can create an Azure Batch and do what I need, but at this point, I'm not sure that it isn't better to do a function and forego the Azure Data Factory (ADF) for this move.  I'm using ADF for replicating on-prem SQL stuff just fine, because it has the writer cleanup script.
At this point, I'd like to just use DocumentDB but I don't see a way to do it given the way my data works.
Here's a look at my pipeline:
{
"name": "R-------ProjectToDocDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": " "
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
/// this is where a cleanup script would be great.
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "ProjectId:ProjectId,.....:CostClassification"
}
},
"inputs": [
{
"name": "InputDataset-shc"
}
],
"outputs": [
{
"name": "OutputDataset-shc"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-_Custom query_->---Project"
}
],
"start": "2017-04-26T20:13:27.683Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "r-----datafactory01_hub",
"pipelineMode": "Scheduled"
}
}
Perhaps there's an update in the pipeline that creates parity between SQL output and DocumentDB

Azure Data Factory did not support clean up script for DocDB today. It's something in our backlog. If you can describe a little bit more for the E2E scenario, could help us priorities. For example, why append to the same collection not work? Is that because there's no way to identify the incremental records after each run? For the clean up requirement, will that always be delete * or it might be based on time stamp, etc. Thanks. Before the support for clean up script was there, custom activity was the only way to workaround now, sorry.

You could use a Logic App that runs on a Timer Trigger.

Error while running U-SQL Activity in Pipeline in Azure Data Factory

I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.

Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.

Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity

I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.

How to define a Table in Azure Data Factory

When creating a HDInsight On Demand Linked Resource, the Data Factory creates a new container for the hdinsight. I wonder to know how I can creates a Table that points to that container? Here is my Table definition
{
"name": "AzureBlobLocation",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobLinkedService",
"typeProperties": {
"folderPath": "????/Folder1/Folder2/Output/Aggregation/",
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
What should goes instead of '????' that I put in there? the keyword in not accepted.

I should use the keyword 'container' in order to point to the working container.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string