I'm trying to reproduce the following architecture based on the following github repo: https://github.com/Azure/cortana-intelligence-price-optimization
The problem is the part linked to the ADF, since in the guide it uses the old version of ADF: I don't know how to map in ADF v2 the "input" and "output" properties of a single activity so that they point to a dataset.
The pipeline performs a spark activity that does nothing more than execute a python script, and then I think it should write data into the dataset I defined already.
Here is the json of the ADF V1 pipeline inside the guide, which I cannot replicate:
"activities": [
{
"type": "HDInsightSpark",
"typeProperties": {
"rootPath": "adflibs",
"entryFilePath": "Sales_Data_Aggregation_2.0_blob.py",
"arguments": [ "modelsample" ],
"getDebugInfo": "Always"
},
"outputs": [
{
"name": "BlobStoreAggOutput"
}
],
"policy": {
"timeout": "00:30:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AggDataSparkJob",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
},
The Spark activity in a Data Factory pipeline executes a Spark program on your own or on-demand HDInsight cluster. This article builds on the data transformation activities article, which presents a general overview of data transformation and the supported transformation activities. When you use an on-demand Spark linked service, Data Factory automatically creates a Spark cluster for you just-in-time to process the data and then deletes the cluster once the processing is complete.
Upload "Sales_Data_Aggregation_2.0_blob.py" to storage account attached to the HDInsight cluster and the modify the sample definition of a spark activity and create a schedule trigger and run the code:
Here is the sample JSON definition of a Spark activity:
{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}
Hope this helps.
Related
I'd like to execute a PySpark Job with dependencies (either egg or zip files) using Data Factory V2.
When running the command directly on the Head Node cluster (HD Insight) in the form of a spark-submit method it goes as following (and works):
spark-submit --py-files 0.3-py3.6.egg main.py 1
In Data Factory (V2) I tried to define the following:
{
"name": "dimension",
"properties": {
"activities": [{
"name": "Spark1",
"type": "HDInsightSpark",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"rootPath": "adfspark",
"entryFilePath": "main.py",
"getDebugInfo": "Always",
"sparkConfig": {
"spark.submit.pyFiles": "0.3-py3.6.egg"
},
"sparkJobLinkedService": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "hdinsightlinkedService",
"type": "LinkedServiceReference"
}
}
]
}
}
Tried to specified the exact path of the dependencies ("wasb://.../0.3-py3.6.egg" or adfspark/pyFiles/0.3-py3.6.egg) like suggested in this thread:
How to setup custom Spark parameter in HDInsights cluster with Data Factory
All this in the context that "adfspark" is the container and the dependencies are located in "pyFiles" folder much like it is suggested in Azure documentation:
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-transform-data-spark-powershell
Running the job just on the Head Node would be a sufficient start although distributed execution is the real target here
I have a USQL script stored on my ADL store and I am trying to execute it. the script file is quite big - about 250Mb.
So far i have a Data Factory, I have created a Linked Service and am trying to create a Data lake Analytics U-SQL Activity.
The code for my U-SQL Activity looks like this:
{
"name": "RunUSQLScript1",
"properties": {
"description": "Runs the USQL Script",
"activities": [
{
"name": "DataLakeAnalyticsUSqlActivityTemplate",
"type": "DataLakeAnalyticsU-SQL",
"linkedServiceName": "AzureDataLakeStoreLinkedService",
"typeProperties": {
"scriptPath": "/Output/dynamic.usql",
"scriptLinkedService": "AzureDataLakeStoreLinkedService",
"degreeOfParallelism": 3,
"priority": 1000
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
],
"start": "2017-05-02T00:00:00Z",
"end": "2017-05-02T00:00:00Z"
}
}
However, I get the following error:
Error
Activity 'DataLakeAnalyticsUSqlActivityTemplate' from >pipeline 'RunUSQLScript1' has no output(s) and no schedule. Please add an >output dataset or define activity schedule.
What i would like is to have this Activity run on-demand, i.e. I do not want it scheduled at all, and also I do not understand what Inputs and Outputs are in my case. The U-SQL Script I am trying to run is operating on millions of files on my ADL storage and is saving them after some modifiction of the content.
Currently ADF does not support running USQL script stored in ADLS for a USQL activity, i.e. the "scriptLinkedService" under "typeProperties" has to be an Azure Blob Storage Linked Service. We will update the documentation for USQL activity to make this more clear.
Supporting running USQL script stored in ADLS is on our product backlog, but we don't have a committed date for this yet.
Shirley Wang
Currently ADF does not support executing the activity on-demand and it needs to be configured with a schedule. You will need at least one output to drive the schedule execution of the activity. The output can be a dummy Azure Storage one without actually write the data out but ADF leverages the availability properties to drive the schedule execution. For example:
{
"name": "OutputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "dummyoutput.txt",
"folderPath": "adf/output",
"format": {
"type": "TextFormat",
"columnDelimiter": "\t"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I am trying to make a periodic copy of all the data returning from an OData query into a documentDB collection, on a daily basis.
The copy works fine using the copy wizard, which is A REALLY GREAT option for simple tasks. Thanks for that.
What isn't working for me though: The copy just adds data each time, and I have NO WAY that I can SEE with a documentDB sink to "pre-delete" the data in the collection (compare to the SQL sink which has sqlWriterCleanupScript, which I could set to something like Delete * from 'table').
I know I can create an Azure Batch and do what I need, but at this point, I'm not sure that it isn't better to do a function and forego the Azure Data Factory (ADF) for this move. I'm using ADF for replicating on-prem SQL stuff just fine, because it has the writer cleanup script.
At this point, I'd like to just use DocumentDB but I don't see a way to do it given the way my data works.
Here's a look at my pipeline:
{
"name": "R-------ProjectToDocDB",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": " "
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
/// this is where a cleanup script would be great.
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "ProjectId:ProjectId,.....:CostClassification"
}
},
"inputs": [
{
"name": "InputDataset-shc"
}
],
"outputs": [
{
"name": "OutputDataset-shc"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-_Custom query_->---Project"
}
],
"start": "2017-04-26T20:13:27.683Z",
"end": "2099-12-31T05:00:00Z",
"isPaused": false,
"hubName": "r-----datafactory01_hub",
"pipelineMode": "Scheduled"
}
}
Perhaps there's an update in the pipeline that creates parity between SQL output and DocumentDB
Azure Data Factory did not support clean up script for DocDB today. It's something in our backlog. If you can describe a little bit more for the E2E scenario, could help us priorities. For example, why append to the same collection not work? Is that because there's no way to identify the incremental records after each run? For the clean up requirement, will that always be delete * or it might be based on time stamp, etc. Thanks. Before the support for clean up script was there, custom activity was the only way to workaround now, sorry.
You could use a Logic App that runs on a Timer Trigger.
I have two dataset, one "FileShare" DS1 and another "BlobSource" DS2. I define a pipeline with one copy activity, which needs to copy the files from DS1 to DS3 (BlobSource), with dependency specified as DS2. The activity is specified below:
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileShare"
},
"sink": {
"type": "BlobSource"
}
},
"inputs": [
{
"name": "FoodGroupDescriptionsFileSystem"
},
{
"name": "FoodGroupDescriptionsInputBlob"
}
],
"outputs": [
{
"name": "FoodGroupDescriptionsAzureBlob"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst"
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "FoodGroupDescriptions",
"description": "#1 Bulk Import FoodGroupDescriptions"
}
Here, how can i specify multiple source type (both FileShare and BlobSource)? It throws error when i try to pass as list.
The copy activity doesn't like multiple inputs or outputs. It can only perform a 1 to 1 copy... It won't even change the filename for you in the output dataset, never mind merging files!
This is probably intentional so Microsoft can charge you more for additional activities. But let's not digress into that one.
I suggest having 1 pipeline copying both files into some sort of Azure storage using separate activities (1 per file). Then have a second down stream pipeline that has a custom activity to read and merge/concatenate the files to produce a single output.
Remember that ADF isn't an ETL tool like SSIS. Its just there to invoke other Azure services. Copying is about a complex as it gets.
I am getting following error while running a USQL Activity in the pipeline in ADF:
Error in Activity:
{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC",
"source":"USER","message":"syntax error.
Final statement did not end with a semicolon","details":"at token 'txt', line 3\r\nnear the ###:\r\n**************\r\nDECLARE #in string = \"/demo/SearchLog.txt\";\nDECLARE #out string = \"/scripts/Result.txt\";\nSearchLogProcessing.txt ### \n",
"description":"Invalid syntax found in the script.",
"resolution":"Correct the script syntax, using expected token(s) as a guide.","helpLink":"","filePath":"","lineNumber":3,
"startOffset":109,"endOffset":112}].
Here is the code of output dataset, pipeline and USQL script which i am trying to execute in pipeline.
OutputDataset:
{
"name": "OutputDataLakeTable",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "LinkedServiceDestination",
"typeProperties": {
"folderPath": "scripts/"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
Pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"script": "SearchLogProcessing.txt",
"scriptPath": "scripts\\",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/demo/SearchLog.txt",
"out": "/scripts/Result.txt"
}
},
"inputs": [
{
"name": "InputDataLakeTable"
}
],
"outputs": [
{
"name": "OutputDataLakeTable"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "CopybyU-SQL",
"linkedServiceName": "AzureDataLakeAnalyticsLinkedService"
}
],
"start": "2017-01-03T12:01:05.53Z",
"end": "2017-01-03T13:01:05.53Z",
"isPaused": false,
"hubName": "denojaidbfactory_hub",
"pipelineMode": "Scheduled"
}
}
Here is my USQL Script which i am trying to execute using "DataLakeAnalyticsU-SQL" Activity Type.
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
Please suggest me how to resolve this issue.
Your script is missing the scriptLinkedService attribute. You also (currently) need to place the U-SQL script in Azure Blob Storage to run it successfully. Therefore you also need an AzureStorage Linked Service, for example:
{
"name": "StorageLinkedService",
"properties": {
"description": "",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myAzureBlobStorageAccount;AccountKey=**********"
}
}
}
Create this linked service, replacing the Blob storage name myAzureBlobStorageAccount with your relevant Blob Storage account, then place the U-SQL script (SearchLogProcessing.txt) in a container there and try again. In my example pipeline below, I have a container called adlascripts in my Blob store and the script is in there:
Make sure the scriptPath is complete, as Alexandre mentioned. Start of the pipeline:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
The input and output .tsv files can be in the data lake and use the the AzureDataLakeStoreLinkedService linked service.
I can see you are trying to follow the demo from: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity#script-definition. It is not the most intuitive demo and there seem to be some issues like where is the definition for StorageLinkedService?, where is SearchLogProcessing.txt? OK I found it by googling but there should be a link in the webpage. I got it to work but felt a bit like Harry Potter in the Half-Blood Prince.
Remove the script attribute in your U-SQL activity definition and provide the complete path to your script (including filename) in the scriptPath attribute.
Reference: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity
I had a similary issue, where Azure Data Factory would not recognize my script files. A way to avoid the whole issue, while not having to paste a lot of code, is to register a stored procedure. You can do it like this:
DROP PROCEDURE IF EXISTS master.dbo.sp_test;
CREATE PROCEDURE master.dbo.sp_test()
AS
BEGIN
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM #in
USING Extractors.Text(delimiter:'|');
#rs1 =
SELECT Start, Region, Duration
FROM #searchlog
WHERE Region == "kota";
OUTPUT #rs1
TO #out
USING Outputters.Text(delimiter:'|');
END;
After running this, you can use
"script": "master.dbo.sp_test()"
in your JSON pipeline definition. Whenever you update the U-SQL script, simply re-run the definition of the procedure. Then there will be no need to copy script files to Blob Storage.