Azure Databricks CLI: update workflow/job definition

Azure Databricks CLI: update workflow/job definition - azure

I have created a pipeline in Azure DevOps to perform the following three steps:
Retrieve the job definition from one Databricks workspace and save it as a json (Databricks CLI config is omitted)
databricks jobs get --job-id $(job_id) > workflow.json
Use this json to update the workflow in a second (separate) Databricks workspace (Databricks CLI is first reconfigured to point to the new workspace)
databricks jobs reset --job-id $(job_id) --json-file workflow.json
Run the updated job in the second Databricks workspace
databricks jobs run-now --job-id $(job_id)
However, my pipeline fails at step 2 with the following error, even though the existing_cluster_id is already defined inside the workflow.json. Any idea?
Error: b'{"error_code":"INVALID_PARAMETER_VALUE","message":"One of job_cluster_key, new_cluster, or existing_cluster_id must be specified."}' 
Here is what my workflow.json looks like (hiding some of the details):
{
"job_id": 123,
"creator_user_name": "user1",
"run_as_user_name": "user1",
"run_as_owner": true,
"settings":
{
"name": "my-workflow",
"existing_cluster_id": "abc-def-123-xyz",
"email_notifications": {
"no_alert_for_skipped_runs": false
},
"webhook_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "notebooks/my-notebook",
"base_parameters": {
"environment": "production"
},
"source": "GIT"
},
"max_concurrent_runs": 1,
"git_source": {
"git_url": "https://my-org#dev.azure.com/my-project/_git/my-repo",
"git_provider": "azureDevOpsServices",
"git_branch": "master"
},
"format": "SINGLE_TASK"
},
"created_time": 1676477563075
}

I figured out that you don't need to retrieve the entire workflow definition json file, as shown in step 1, but only the "settings" part, i.e. modifying step 1 to this solved my issue:
databricks jobs get --job-id $(job_id) | jq .settings > workflow.json

Related

Azure Data Factory Get Metadata activity returning "(404) not found" error when getting column count

I am trying to implement a Get Metadata activity to return the column count of files I have in a single blob storage container.
Get Metadata activity is returning this error:
Error
I'm fairly new to Azure Data Factory and cannot solve this. Here's what I have:
Dataset:Source dataset
Name- ten_eighty_split_CSV
Connection- Blob storage
Schema- imported from blob storage file
Parameters- "FileName"; string; "#pipeline().parameters.SourceFile"
Pipeline:
Name: ten eighty split
Parameters: "SourceFile"; string; "#pipeline().parameters.SourceFile"
Settings: Concurrency: 1
Get Metadata activity: Get Metadata
Only argument is "Column count"
Throws the error upon debugging. I am not sure what to do, (404) not found is so broad I could not ascertain a specific solution. Thanks!

The error occurs because you have given incorrect file name (or) name of a file that does not exist.
Since you are trying to use blob created event trigger to find the column count, you can use the procedure below:
After configuring the get metadata activity, create a storage event trigger. Go to Add trigger -> choose trigger -> Create new.
Click on continue. You will get a Trigger Run Parameters tab. In this, give the value as #triggerBody().fileName.
Complete the trigger creation and publish the pipeline. Now whenever the file is uploaded into your container (on top of which you created storage event trigger), it will trigger the pipeline automatically (no need to debug). If the container is empty and you try to debug by giving some value for sourceFile parameter, it would give the same error.
Upload a sample file to your container. It will trigger the pipeline and give the desired result.
The following is the trigger JSON that I created for my container:
{
"name": "trigger1",
"properties": {
"annotations": [],
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "pipeline1",
"type": "PipelineReference"
},
"parameters": {
"sourceFile": "#triggerBody().fileName"
}
}
],
"type": "BlobEventsTrigger",
"typeProperties": {
"blobPathBeginsWith": "/data/blobs/",
"blobPathEndsWith": ".csv",
"ignoreEmptyBlobs": true,
"scope": "/subscriptions/b83c1ed3-c5b6-44fb-b5ba-2b83a074c23f/resourceGroups/<user>/providers/Microsoft.Storage/storageAccounts/blb1402",
"events": [
"Microsoft.Storage.BlobCreated"
]
}
}
}

renaming files in a nested directory with azure data factory

I have a daily export set up for several subscriptions - the files export like so
with 7 different directories within daily -- i'm simply trying to rename the files to get rid of the underscore for data flows
my parent pipeline looks like so
get metadata gets the folder names and for each invokes the child pipeline like so
here are the screen grabs of the child pipeline
copy data within the foreach1 -- the source
and now the sink - this is where i want to rename the file, the first time i debugged it simply copied them to the correct place with a .txt extension, the next time it got the extension right but it is not renaming the file,
i replaced #replace(item().name, '_', '-') with #replace(activity('FileInfo').output.itemName, '_','-') and got the following error
The expression '#replace(activity('FileInfo').output.itemName, '_','-')' cannot be evaluated because property 'itemName' doesn't exist, available properties are 'childItems, effectiveIntegrationRuntime, executionDuration, durationInQueue, billingReference'.
so then I replaced that with
#replace(activity('FileInfo').output.childItems, '_', '-')
but that gives the following error
Cannot fit childItems return type into the function parameter string
I'm not sure where to go from here
edit 7/14
making the change from the answer below
here is my linked service for the sink dataset with the parameter renamedFile
here is the sink on the copy data1 for the child_Rename pipeline, it grayed out the file extension as this was mentioned
now here is the sink container after running the pipeline
this is the directory structure of the source data - it's dynamically created from scheduled daily azure exports
here is the output of get metadata - FileInfo from the child pipeline
{
"childItems": [
{
"name": "daily",
"type": "Folder"
}
],
"effectiveIntegrationRuntime": "integrationRuntime1 (Central US)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
allsubs - source container
daily - directory created by the scheduled export
sub1 - subN - the different subs with scheduled exports
previous-month -> this-month - monthly folders are created automatically
this_fileXX.csv -- files are automatically generated with the underscore in the name - it is my understanding that data flows cannot handle these characters in the file name
├──allsubs/
└── daily/
├── sub1/
| └── previous-month/
└── this_file.csv
└── this_file1.csv
| └── previous-month/
└── this_file11.csv
└── this_file12.csv
| └── this-month/
├── subN/
| └── previous-month/
| └── previous-month/
| └── this-month/
└── this_fileXX.csv
edit 2 - july 20
I think i'm getting closer but there are still some small errors i do not see
the pipeline now moves all the files from the container allsubs to the container renamed-files but it is not renaming the files - it looks like so
Get Metadata -from the dataset allContainers it retrieves the folders with the Child Items
dataset allContainers shown (preview works, linked service works, no paremeters in this dataset)
next the forEach activity calls the output of get metadata
for the items #activity('Get Metadata1').output.childItems
next shown is the copy data within ForEach
the source is the allContainers dataset with the wildcard file path selected, recursively selected and due to the following error max concurrent connections set at 1 -- but this did not resolve the error
error message:
Failure happened on 'Sink' side.
ErrorCode=AzureStorageOperationFailedConcurrentWrite,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Error occurred when trying to upload a file.
It's possible because you have multiple concurrent copy activities
runs writing to the same file 'renamed-files/rlcosts51122/20220601-20220630/rlcosts51122_082dd29b-95b2-4da5-802a-935d762e89d8.csv'.
Check your ADF configuration.
,Source=Microsoft.DataTransfer.ClientLibrary,
''Type=Microsoft.WindowsAzure.Storage.StorageException,
Message=The remote server returned an error: (400) Bad
Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.
RequestId:b519219f-601e-000d-6c4c-9c9c5e000000
Time:2022-07-
20T15:23:51.4342693Z,,''Type=System.Net.WebException,
Message=The remote server returned an error: (400) Bad
Request.,Source=Microsoft.WindowsAzure.Storage,'
copy data source:
copy data sink - the dataset is dsRenamesink, it's simply another container in a different storage account, linked service is set up correctly, it has the parameter renamedFile but I suspect this is the source of my error. still testing that.
sink dataset dsRenamesink:
parmeter page:
here's the sink in the copy data where the renamed file is passed the iterator from ForEach1 like so:
#replace(item().name,'_','renameworked')
so the underscore would be replaced with 'renameworked' easy enough to test
debugging the pipeline
the errors look to be consistent for the 7 failures which was shown above as the 'failure happened on the sink side'
however - going into the storage account sink i can see that all of the files from the source were copied over to the sink but the files were not renamed like so
pipeline output:
error messages:
{
"dataRead": 28901858,
"dataWritten": 10006989,
"filesRead": 4,
"filesWritten": 0,
"sourcePeakConnections": 1,
"sinkPeakConnections": 1,
"copyDuration": 7,
"throughput": 4032.067,
"errors": [
{
"Code": 24107,
"Message": "Failure happened on 'Sink' side. ErrorCode=AzureStorageOperationFailedConcurrentWrite,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error occurred when trying to upload a file. It's possible because you have multiple concurrent copy activities runs writing to the same file 'renamed-files/rlcosts51122/20220601-20220630/rlcosts51122_082dd29b-95b2-4da5-802a-935d762e89d8.csv'. Check your ADF configuration.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.\nRequestId:b519219f-601e-000d-6c4c-9c9c5e000000\nTime:2022-07-20T15:23:51.4342693Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,'",
"EventType": 0,
"Category": 5,
"Data": {
"FailureInitiator": "Sink"
},
"MsgId": null,
"ExceptionType": null,
"Source": null,
"StackTrace": null,
"InnerEventInfos": []
}
],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Central US)",
"usedDataIntegrationUnits": 4,
"billingReference": {
"activityType": "DataMovement",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.06666666666666667,
"unit": "DIUHours"
}
]
},
"usedParallelCopies": 1,
"executionDetails": [
{
"source": {
"type": "AzureBlobFS",
"region": "Central US"
},
"sink": {
"type": "AzureBlobStorage"
},
"status": "Failed",
"start": "Jul 20, 2022, 10:23:44 am",
"duration": 7,
"usedDataIntegrationUnits": 4,
"usedParallelCopies": 1,
"profile": {
"queue": {
"status": "Completed",
"duration": 3
},
"transfer": {
"status": "Completed",
"duration": 2,
"details": {
"listingSource": {
"type": "AzureBlobFS",
"workingDuration": 0
},
"readingFromSource": {
"type": "AzureBlobFS",
"workingDuration": 0
},
"writingToSink": {
"type": "AzureBlobStorage",
"workingDuration": 0
}
}
}
},
"detailedDurations": {
"queuingDuration": 3,
"transferDuration": 2
}
}
],
"dataConsistencyVerification": {
"VerificationResult": "NotVerified"
},
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}
all i wanted to do was remove the underscore from the file name to work with data flows....I'm not sure what else to try next
next attempt july 20
it appears that now I have been able to copy and rename some of the files -
changing the sink dataset as follows
#concat(replace(dataset().renamedFile,'_','-'),'',formatDateTime(utcnow(),'yyyyMMddHHmmss'),'.csv')
and removing this parameter from the sink in the copy activity
upon debugging this pipeline I get 1 file in the sink and it is named correctly but there is still something wrong
third attempt 7/20
further updating to be closer to the original answer
sink dataset
copy data activity in the sink - concat works
now after debugging i'm left with 1 file for each of the subs - so there is something still not quite correct

I reproduce the same thing in my environment.
Go to Sink dataset, click and open.First create parameters and add dynamic content, I used this expression #dataset().sinkfilename
In copy activity sink, under dataset properties pass the filename value using the expression #replace(item().name,'_','-') to replace _ with -.
when you create a dataset parameter to pass the filename, the File extension property is automatically disabled.
when the pipeline runs you can see the file name has been renamed accordingly.

Azure DevOps API - how to reference other pipeline as resource parameter

I have an Azure DevOps pipeline and want to reference other pipeline that my pipeline will fetch the artefacts from. I am struggling to find a way to actually do it over REST API.
https://learn.microsoft.com/en-us/rest/api/azure/devops/pipelines/runs/run%20pipeline?view=azure-devops-rest-6.1 specifies there is a BuildResourceParameters or PipelineResourceParameters but I cannot find a way to get it to work.
For example:
Source pipeline A produces an artefact B in run C. I want to tell API to reference the artefact B from run C of pipeline A rather than refer to the latest.
Anyone?

In your current situation, we recommend you can follow the below request body to help you select your reference pipeline version.
{
"stagesToSkip": [],
"resources": {
"repositories": {
"self": {
"refName": "refs/heads/master"
}
},
"pipelines": {
"myresourcevars": {
"version": "1313"
}
}
},
"variables": {}
}
Note: The name 'myresourcevars' is the pipeline name you defined in your yaml file:
enter image description here

Add steps to a build definition in AzureDevOps 2019

I'm trying to create ADOS build definitions programmatically. I found a similar question with an answer here: How to create Build Definitions through VSTS REST API
In the answer example, the steps property is empty. I included some steps (taken from a JSON gotten from another build definition using the same API). The result is that the created build definitions has no steps.
I dug into the .NET API browser, and found that there is a BuildProcess classs with a Process property which should take a DesignerProcess for TFVC pipelines (since YAML is only suported for Git repos), DesignerProcess has a Phase property which is readonly, that maybe the reason why it's not creating my steps
However I still need to find out a way to create a builds steps programmatically

However I still need to find out a way to create a builds steps programmatically.
If you don't know what to add to the step property, you can grab request body in developer console window when saving a Classic UI Pipeline.
Here are the detailed steps:
Create a Classic UI with steps you want in ADOS. (Don't save it in this step)
If you are using edge, press F12 to open developer console window. Then choose 'NetWork'.
Click Save and you will find a record called 'definitions'.
Click it and the request body is at the bottom of the page. You will find steps-related information in Process and processParameters properties.
If you are using a different browser, there might be some slight differences in step 2, 3 and 4.
Then you can edit and add the script in your REST API request body.
Here is a simple example of request body that includes a Command Line task.
"process": {
"phases": [
{
"condition": "succeeded()",
"dependencies": [],
"jobAuthorizationScope": 1,
"jobCancelTimeoutInMinutes": 0,
"jobTimeoutInMinutes": 0,
"name": "Agent job 1",
"refName": "Job_1",
"steps": [
{
"displayName": "Command Line Script",
"refName": null,
"enabled": true,
"continueOnError": false,
"timeoutInMinutes": 0,
"alwaysRun": false,
"condition": "succeeded()",
"inputs": {
"script": "echo Hello world\n",
"workingDirectory": "",
"failOnStderr": "false"
},
"overrideInputs": {},
"environment": {},
"task": {
"id": "d9bafed4-0b18-4f58-968d-86655b4d2ce9",
"definitionType": "task",
"versionSpec": "2.*"
}
}
],
"target": {
"type": 1,
"demands": [],
"executionOptions": {
"type": 0
}
},
"variables": {}
}
],
"type": 1,
"target": {
"agentSpecification": {
"metadataDocument": "https://mmsprodea1.vstsmms.visualstudio.com/_apis/mms/images/VS2017/metadata",
"identifier": "vs2017-win2016",
"url": "https://mmsprodea1.vstsmms.visualstudio.com/_apis/mms/images/VS2017"
}
},
"resources": {}
}
What's more, creating YAML pipelines by REST API is not supported currently. Click this question for detailed information.

How to get the Azure Data Factory parameters into the ARM template parameters file (ARMTemplateParametersForFactory.json) after publishing

I am trying to create my Azure DevOps release pipeline for Azure Data Factory.
I have followed the rather cryptic guide from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment ) regarding adding additional parameters to the ARM template that gets generated when you do a publish (https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment#use-custom-parameters-with-the-resource-manager-template )
Created a arm-template-parameters-definition.json file in the route of the master branch. When I do a publish, the ARMTemplateParametersForFactory.json in the adf_publish branch remains completely unchanged. I have tried many configurations.
I have defined some Pipeline Parameters in Data Factory and want them to be configurable in my deployment pipeline. Seems like an obvious requirement to me.
Have I missed something fundamental? Help please!
The JSON is as follows:
{
"Microsoft.DataFactory/factories/pipelines": {
"*": {
"properties": {
"parameters": {
"*": "="
}
}
}
},
"Microsoft.DataFactory/factories/integrationRuntimes": {
"*": "="
},
"Microsoft.DataFactory/factories/triggers": {},
"Microsoft.DataFactory/factories/linkedServices": {},
"Microsoft.DataFactory/factories/datasets": {}
}

I've been struggling with this for a few days and did not found a lot of info, so here what I've found out. You have to put the arm-template-parameters-definition.json in the configured root folder of your collaboration branch:
So in my example, it has to look like this:
If you work in a separate branch, you can test your configuration by downloading the arm templates from the data factory. When you make a change in the parameters-definition you have to reload your browser screen (f5) to refresh the configuration.
If you really want to parameterize all of the parameters in all of the pipelines, the following should work:
"Microsoft.DataFactory/factories/pipelines": {
"properties": {
"parameters":{
"*":{
"defaultValue":"="
}
}
}
}
I prefer specifying the parameters that I want to parameterize:
"Microsoft.DataFactory/factories/pipelines": {
"properties": {
"parameters":{
"LogicApp_RemoveFileFromADLSURL":{
"defaultValue":"=:-LogicApp_RemoveFileFromADLSURL:"
},
"LogicApp_RemoveBlob":{
"defaultValue":"=:-LogicApp_RemoveBlob:"
}
}
}
}

Just to clarify on top of Simon's great answer. If you have non standard git hierarchy (i.e. you move the root to a sub-folder like I have done below with "Source"), it can be confusing when the doc refers to the "repo root". Hopefully this diagram helps.

You've got the right idea, but the arm-template-parameters-definition.json file needs to follow the hierarchy of the element you want to parameterize.
Here is my pipeline activity I want to parameterize. The "url" should change based on the environment it's deployed in
{
"name": "[concat(parameters('factoryName'), '/ExecuteSPForNetPriceExpiringContractsReport')]",
"type": "Microsoft.DataFactory/factories/pipelines",
"apiVersion": "2018-06-01",
"properties": {
"description": "",
"activities": [
{
"name": "NetPriceExpiringContractsReport",
"description": "Passing values to the Logic App to generate the CSV file.",
"type": "WebActivity",
"typeProperties": {
"url": "[parameters('ExecuteSPForNetPriceExpiringContractsReport_properties_1_typeProperties')]",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"body": {
"resultSet": "#activity('NetPriceExpiringContractsReportLookup').output"
}
}
}
]
}
}
Here is the arm-template-parameters-definition.json file that turns that URL into a parameter.
{
"Microsoft.DataFactory/factories/pipelines": {
"properties": {
"activities": [{
"typeProperties": {
"url": "-::string"
}
}]
}
},
"Microsoft.DataFactory/factories/integrationRuntimes": {},
"Microsoft.DataFactory/factories/triggers": {},
"Microsoft.DataFactory/factories/linkedServices": {
"*": "="
},
"Microsoft.DataFactory/factories/datasets": {
"*": "="
}
}
So basically in the pipelines of the ARM template, it looks for properties -> activities -> typeProperties -> url in the JSON and parameterizes it.

Here are the necessary steps to clear up confusion:
Add the arm-template-parameters-definition.json to your master branch.
Close and re-open your Dev ADF portal
Do a new Publish
Your ARMTemplateParametersForFactory.json will then be updated.

I have experienced similar problems with the ARMTemplateParametersForFactory.json file not being updated whenever I publish and have changed the arm-template-parameters-definition.json.
I figured that I can force update the Publish branch by doing the following:
Update the custom parameter definition file as you wish.
Delete ARMTemplateParametersForFactory.json from the Publish branch.
Refresh (F5) the Data Factory portal.
Publish.
The easiest way to validate your custom parameter .json syntax seems to be by exporting the ARM template, just as Simon mentioned.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure Databricks CLI: update workflow/job definition - azure

I figured out that you don't need to retrieve the entire workflow definition json file, as shown in step 1, but only the "settings" part, i.e. modifying step 1 to this solved my issue: databricks jobs get --job-id $(job_id) | jq .settings > workflow.json

Related

Azure Data Factory Get Metadata activity returning "(404) not found" error when getting column count

renaming files in a nested directory with azure data factory

Azure DevOps API - how to reference other pipeline as resource parameter

Add steps to a build definition in AzureDevOps 2019

How to get the Azure Data Factory parameters into the ARM template parameters file (ARMTemplateParametersForFactory.json) after publishing

Categories

Resources