How to delete an experiment from an azure machine learning workspace - azure-machine-learning-service

I create experiments in my workspace using the python sdk (azureml-sdk). I now have a lot of 'test' experiments littering our workspace. How can I delete individual experiments either through the api or on the portal. I know I can delete the whole workspace but there are some good experiments we don't want to delete
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-export-delete-data#delete-visual-interface-assets suggests it is possible but my workspace view does not look anything like what is shown there

Experiment deletion is a common request and we in Azure ML team are working on it. Unfortunately it's not supported quite yet.

Starting from 2021-08-24 Azure ML Workspace release you can delete the experiment - but only by clicking in UI (Select Experiment in Experiments view -> 'Delete')
Watch out - deleting the experiment will delete all the underlying runs - and deleting a run will delete the child runs, run metrics, metadata, outputs, logs and working directories!
Only for experiments without any underlying runs you can use Python SDK (azureml-core==1.34.0) - Experiment class delete static method, example:
from azureml.core import Workspace, Experiment
aml_workspace = Workspace.from_config()
experiment_id = Experiment(aml_workspace, '<experiment_name>').id
Experiment.delete(aml_workspace, experiment_id)
If an experiment has runs you will get an error:
CloudError: Azure Error: UserError
Message: Only empty Experiments can be deleted. This experiment contains run(s)
I hope Azure ML team gets this functionality to Python SDK soon!
Also on a sad note - would be great if you optimize the deletion - for now it seems like extremely slow (implementation) synchronous (need async as well) call...

You can delete your experiment with the following code:
# Declare your experiment
from azureml.core import Experiment
experiment = Experiment(workspace=ws, name="<your_experiment>")
# Delete the experiment
experiment.archive()
# Now check the list of experiments on your AML wokrspace and see that it was deleted

This issue is still opened at the moment. What I have figure out to avoid many experiments in workspace is run locally in Python SDK and after upload output files to the run's outputs folder when the run completes.
You can define it as:
run.upload_file(name='outputs/sample.csv', path_or_stream='./sample.csv')

Follow the two steps:
1.Delete experiment's child jobs in Azure Studio, here is how:
2.Delete the (empty) experiment with Python API, here is how:
from azureml.core import Workspace, Experiment, Run
# choose the workspace and experiment
ws = Workspace.from_config()
exp_name = 'digits_recognition'
# ... delete first experiment's child jobs in Azure Studio
exp = Experiment(ws,exp_name)
Experiment.delete(ws,exp.id)
Note: for a more fine-grained control over deletions, use Azure CLI.

Related

Azure Pipeline Check is any builds are running in different azure pipeline task

So to give you a bit of context we have a service which has been split into two different services ie one for the read and one for the write side operations. The read side is called ProductStore and the write side is called ProductCatalog. The issue were facing was down the write side as the load tests create 100 products in the write side resource web app and then they are transferred to the read side for the load test to then read x number of times. If a build is launched in the product catalog because something new was merged to master then this will cause issues in the product store pipeline if it gets run concurrently.
The question I want to ask is there a way in the ProductStore yaml file to directly query via a specified azure task or via an AzurePowershell script to check if a build is currently running in the ProductCatalog pipeline.
The second part of this would be to loop/wait until that pipeline has successfully finished before resuming the product store pipeline.
Hope this is clear as I'm not sure how to best ask this question as I'm very new to the DevOps pipelines flow but this would massively help if there was a good way of checking this sort of thing.
As a workaround , you can set Pipeline completion trigger in ProductStore pipeline.
To trigger a pipeline upon the completion of another, specify the triggering pipeline as a pipeline resource.
Or configure build completion triggers in the UI, choose Triggers from the settings menu, and navigate to the YAML pane.

Azure Data Factory V2 multiple environments like in SSIS

I'm coming from a long SSIS background, we're looking to use Azure data factory v2 but I'm struggling to find any (clear) way of working with multiple environments. In SSIS we would have project parameters tied to the Visual Studio project configuration (e.g. development/test/production etc...) and say there were 2 parameters for SourceServerName and DestinationServerName, these would point to different servers if we were in development or test.
From my initial playing around I can't see any way to do this in data factory. I've searched google of course, but any information I've found seems to be around CI/CD then talks about Git 'branches' and is difficult to follow.
I'm basically looking for a very simple explanation and example of how this would be achieved in Azure data factory v2 (if it is even possible).
It works differently. You create an instance of data factory per environment and your environments are effectively embedded in each instance.
So here's one simple approach:
Create three data factories: dev, test, prod
Create your linked services in the dev environment pointing at dev sources and targets
Create the same named linked services in test, but of course these point at your tst systems
Now when you "migrate" your pipelines from dev to test, they use the same logical name (just like a connection manager)
So you don't designate an environment at execution time or map variables or anything... everything in test just runs against test because that's the way the linked servers have been defined.
That's the first step.
The next step is to connect only the dev ADF instance to Git. If you're a newcomer to Git it can be daunting but it's just a version control system. You save your code to it and it remembers every change you made.
Once your pipeline code is in git, the theory is that you migrate code out of git into higher environments in an automated fashion.
If you go through the links provided in the other answer, you'll see how you set it up.
I do have an issue with this approach though - you have to look up all of your environment values in keystore, which to me is silly because why do we need to designate the test servers hostname everytime we deploy to test?
One last thing is that if you a pipeline that doesn't use a linked service (say a REST pipeline), I haven't found a way to make that environment aware. I ended up building logic around the current data factories name to dynamically change endpoints.
This is a bit of a bran dump but feel free to ask questions.
Although it's not recommended - yes, you can do it.
Take a look at Linked Service - in this case, I have a connection to Azure SQL Database:
You have possibilities to use dynamic content for either the server name and database name.
Just add a parameter to your pipeline, pass it to the Linked Service and use in the required field.
Let me know whether I explained it clearly enough?
Yes, it's possible although not so simple as it was in VS for SSIS.
1) First of all: there is no desktop application for developing ADF, only the browser.
Therefore developers should make the changes in their DEV environment and from many reasons, the best way to do it is a way of working with GIT repository connected.
2) Then, you need "only":
a) publish the changes (it creates/updates adf_publish branch in git)
b) With Azure DevOps deploy the code from adf_publish replacing required parameters for target environment.
I know that at the beginning it sounds horrible, but the sooner you set up an environment like this the more time you save while developing pipelines.
How to do these things step by step?
I describe all the steps in the following posts:
- Setting up Code Repository for Azure Data Factory v2
- Deployment of Azure Data Factory with Azure DevOps
I hope this helps.

What is a good Databricks workflow

I'm using Azure Databricks for data processing, with notebooks and pipeline.
I'm not satisfied with my current workflow:
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Git integration is very simple, but this is not my main concern.
Great question. Definitely dont modify your production code in place.
One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.
You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.
Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.
Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.
I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.
Edit: To be more explicit.
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.
Git integration is very simple, but this is not my main concern.
Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.
The packaging solution works as follows Reference:
List item
On your desktop, install pyspark
Download some anonymized data to work with
Develop your code with small bits of data, writing unit tests
When ready to test on big data, uninstall pyspark, install databricks-connect
When performance and integration is sufficient, push code to your remote repo
Create a build pipeline that runs automated tests, and builds the versioned package
Create a release pipeline that copies the versioned package to DBFS
In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
Pass the arguments to your module to run your tested code
The way we are doing it -
-Integrate the Dev notebooks with Azure DevOps.
-Create custom Build and Deployment tasks for Notebook, Jobs, package and cluster deployments. This is sort of easy to do with the DatabBricks RestAPI
https://docs.databricks.com/dev-tools/api/latest/index.html
Create Release pipeline for Test, Staging and Production deployments.
-Deploy on Test and test.
-Deploy on Staging and test.
-Deploy on production
Hope this can help.

Variable substitution in build pipeline

There are tons of resources online on how to replace JSON configuration files in a release pipeline like this one. I configured this. It works. However, we have multiple integration tests which reach the database too. These tests are run during build time. I haven't seen any option yet to replace config values in the build pipeline. Does it exist? Or do I really have to use this custom task (see screenshot below)?
There is an out-of-the-box task since recently by Microsoft. It's called File Transform. It's currently in preview but it works really well! Haven't had any issues whatsoever with it and it works the same as you would configure it in the release pipeline. Would recommend this any day!
Below you can see my configuration.
There is no out-of-the-box task only to replace tokens/values in files (also in the release pipline the task is Azure App Service Deploy and not only for replace json configuration).
You need to use an external extension from here or write a PowerShell script for that.

Rerun (reschedule) onetime pipeline

I've created a new onetime pipeline in Azure Data Factory using Copy Data wizard.
The pipeline has 4 activities and it was run just fine - all 4 activities succeeded.
Now I'd like to rerun the pipeline. So I do:
Clone the pipeline.
Change name to [name]_rev2.
Remove start and end properties.
Deploy the cloned pipeline.
Now the status of the new cloned pipeline is Running.
But no activities are executed at all.
What's wrong?
Mmmmm. Where to start!
In short. You can't just clone the pipeline. If its a one time data factory that you've created it won't have a schedule attached and therefore won't have any time slices provisioned that now require execution.
If your unsure how time slices in ADF work a recommend some reading around this concept.
Next, I recommend opening up a Visual Studio 2015 solution and downloading the data factory that did run as a project. Check out the various JSON blocks for scheduling and interval availability in the datasets, activities and the time frame in the pipeline.
Here's a handy guide for scheduling and execution to understand how to control your pipelines.
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution
Once you've made all the required changes (not just the name) publish the project to a new ADF service.
Hope this helps

Resources