How to submit local jobs with dsl.pipeline - azure-machine-learning-service

Trying to run and debug a pipeline locally. Pipeline is imeplemented with azure.ml.component.dsl.pipeline. When I try to set default_compute_target='local', the compute target cannot be found:
local not found in workspace, assume this is an AmlCompute
...
File "/home/amirabdi/miniconda3/envs/stm/lib/python3.8/site-packages/azure/ml/component/run_settings.py", line 596, in _get_compute_type
raise InvalidTargetSpecifiedError(message="Cannot find compute '{}' in workspace.".format(compute_name))
azure.ml.component._util._exceptions.InvalidTargetSpecifiedError: InvalidTargetSpecifiedError:
Message: Cannot find compute 'local' in workspace.
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Cannot find compute 'local' in workspace."
}
}
The local run, for example, can be achieved with azureml.core.ScriptRunConfig.
src = ScriptRunConfig(script="train.py", compute_target="local", environment=myenv)
run = exp.submit(src)

We have different types of compute targets and one of those is local computer.
Create an experiment
from azureml.core import Experiment
experiment_name = 'my_experiment'
experiment = Experiment(workspace=ws, name=experiment_name)
Select the compute target where we need to run
compute_target='local'
If the compute_target is not mentioned or ScriptRunConfig is not mentioned, then AzureML will run the script locally
from azureml.core import Environment
myenv = Environment("user-managed-env")
myenv.python.user_managed_dependencies = True
Create the script job, based on the procedure mentioned in link
Submit the experiment
run = experiment.submit(config=src)
run.wait_for_completion(show_output=True)
To check for the troubleshooting the procedure, check with link

Related

Cannot create Repo with Databricks CLI

I am using Azure DevOps and Databricks. I created a simplified CI/CD Pipeline which triggers the following Python script:
existing_cluster_id = 'XXX'
notebook_path = './'
repo_path = '/Repos/abc#def.at/DevOpsProject'
git_url = 'https://dev.azure.com/XXX/DDD/'
import json
import time
from datetime import datetime
from databricks_cli.configure.config import _get_api_client
from databricks_cli.configure.provider import EnvironmentVariableConfigProvider
from databricks_cli.sdk import JobsService, ReposService
config = EnvironmentVariableConfigProvider().get_config()
api_client = _get_api_client(config, command_name="cicdtemplates-")
repos_service = ReposService(api_client)
repo = repos_service.create_repo(url=git_url, provider="azureDevOpsServices", path=repo_path+"_new")
When I run the pipeline I always get an error (from the last line):
2022-12-07T23:09:23.5318746Z raise requests.exceptions.HTTPError(message, response=e.response)
2022-12-07T23:09:23.5320017Z requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://adb-XXX.azuredatabricks.net/api/2.0/repos
2022-12-07T23:09:23.5321095Z Response from server:
2022-12-07T23:09:23.5321811Z { 'error_code': 'BAD_REQUEST',
2022-12-07T23:09:23.5322485Z 'message': 'Remote repo not found. Please ensure that:\n'
2022-12-07T23:09:23.5323156Z '1. Your remote Git repo URL is valid.\n'
2022-12-07T23:09:23.5323853Z '2. Your personal access token or app password has the correct '
2022-12-07T23:09:23.5324513Z 'repo access.'}
In Databricks, I connect my repo with Azure DevOps: In Git I created a full access token which I added to Databricks' Git Integration and I am able to pull and push in Databricks.
For my CI/CD pipeline, I created variables containing my Databricks Host address and my token. When I change the token, I get a different error message (403 http code) - so the token seems to be fine.
Here a screenshot of my variables.
I have really no clue what I am doing wrong. I tried to run a simplified version of the official Databricks code here.
I tried to reproduce the error with the databricks CLI. I found out that simply _git was missing in the git repo url

Get local workspace in azureml

I am trying to run a machine learning experiment in azureml.
I can't figure out how to get the workspace context from the control script. Examples like this one in the microsoft docs use Workspace.from_config(). When I use this in the control script I get the following error:
"message": "We could not find config.json in: [path] or in its parent directories. Please provide the full path to the config file or ensure that config.json exists in the parent directories."
I've also tried including my subscription id and the resource specs like so:
subscription_id = 'id'
resource_group = 'name'
workspace_name = 'name'
workspace = Workspace(subscription_id, resource_group, workspace_name)
In this case I have to monitor the log and authenticate on each run as I would locally.
How do you get the local workspace from a control script for azureml?
Using Workspace.from_config() method:
The workspace configuration file is a JSON file that tells the SDK how to communicate with your Azure Machine Learning workspace. The file is named config.json, and it has the following format:
{"subscription_id": "<subscription-id>",
"resource_group": "<resource-group>",
"workspace_name": "<workspace-name>"}
IMPORTANT: This JSON file must be in the directory structure that contains your
Python scripts or Jupyter Notebooks. It can be in the same directory,
a subdirectory named .azureml, or in a parent directory.
Alternatively, use the get method to load an existing workspace without using configuration files: (in your case, your code is missing the .get())
ws = Workspace.get(name="myworkspace",subscription_id='<azure-subscription-id>',resource_group='myresourcegroup')
What is the development system that you are using? A DSVM in the AML workspace or your local dev system?
If it is your local then use this to write config file to your project root directory under the path /.azureml/config.json
from azureml.core import Workspace
subscription_id = 'xxxx-xxxx-xxxx-xxxx-xxxx'
resource_group = 'your_resource_group'
workspace_name = 'your_workspace_name'
try:
ws = Workspace(subscription_id = subscription_id, resource_group =
resource_group, workspace_name = workspace_name)
ws.write_config()
print('Library configuration succeeded')
except:
print('Workspace not found')
or else if it DSVM, then you are all set, Workspace.from_config() should work.
Note: You will have to see .config directory under your user name in AML studio.
This had no answers for 10 months, and now they are coming in :). I figuerd this out quite a while ago but haven't gotten around to posting the answer. Here it is.
From the training script, you can get the workspace from the run context as follows:
from azureml.core import Run
Run.get_context()
ws = run.experiment.workspace

Cannot list pipeline steps using AzureML CLI

I'm trying to list steps in a pipeline using AzureML CLI extension, but get an error:
>az ml run list -g <group> -w <workspace> --pipeline-run-id 00886abe-3f4e-4412-aec3-584e8c991665
UserErrorException:
Message: Cannot specify ['--last'] for pipeline runs
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Cannot specify ['--last'] for pipeline runs"
}
}
From help it looks like --last option takes the default value 10 despite the fact that it is not supported for the --pipeline-run-id. How the latter is supposed to work?

Azure Machine learning fails when trying to deploy model

I'm currently trying to deploy a model on azure and expose it's endpoint to my application but I kept running into errors
DEPLOYMENT CODE
model = run.register_model(model_name='pytorch-modeloldage', model_path="outputs/model") print("Starting.........")
inference_config = InferenceConfig(runtime= "python",
entry_script="pytorchscore.py",
conda_file="myenv.yml")
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,auth_enabled=True,
memory_gb=1,
tags={'name':'oldageml', 'framework': 'pytorch'},
description='oldageml training')
service = Model.deploy(workspace=ws,
name='pytorch-olageml-run',
models=[model],
inference_config=inference_config,
overwrite=True,
deployment_config=aciconfig)
service.wait_for_deployment(True)
# print(service.get_logs()) print("bruh did you run", service.scoring_uri) print(service.state)
ERROR
ERROR - Service deployment polling reached non-successful terminal state, current service state: Transitioning
More information can be found here:
Error:
{
"code": "EnvironmentBuildFailed",
"statusCode": 400,
"message": "Failed Building the Environment."
}
I had this error, too, and I was convinced it was working a few days ago!
Anyway, I realised that I was using python 3.5 in my environment definition.
I changed that to 3.6 and it works! I notice that there was a new release of azureml-code on 9 Dec 2019.
This is my code for changing the environment; I add the environment for a variable rather than a file as you do, so that's a bit different.
myenv=Environment(name="env-keras")
conda_packages = ['numpy']
pip_packages = ['tensorflow==2.0.0', 'keras==2.3.1', 'azureml-sdk','azureml-defaults']
mycondaenv = CondaDependencies.create(conda_packages=conda_packages, pip_packages=pip_packages, python_version='3.6.2')
myenv.python.conda_dependencies=mycondaenv
myenv.register(workspace=ws)
inference_config = InferenceConfig(entry_script='score.py',source_directory='.',environment=myenv)

MLflow remote execution on databricks from windows creates an invalid dbfs path

I'm researching the use of MLflow as part of our data science initiatives and I wish to set up a minimum working example of remote execution on databricks from windows.
However, when I perform the remote execution a path is created locally on windows in the MLflow package which is sent to databricks. This path specifies the upload location of the '.tar.gz' file corresponding to the Github repo containing the MLflow Project. In cmd this has a combination of '\' and '/', but on databricks there are no separators at all in this path, which raises the 'rsync: No such file or directory (2)' error.
To be more general, I reproduced the error using an MLflow standard example and following this guide from databricks. The MLflow example is the sklearn_elasticnet_wine, but I had to add a default value to a parameter so I forked it and the MLproject which can be executed remotely can be found at (forked repo).
The Project can be executed remotely by the following command (assuming a databricks instance has been set up)
mlflow run https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine -b databricks -c db-clusterconfig.json --experiment-id <insert-id-here>
where "db-clusterconfig.json" correspond to the cluster to set up in databricks and is in this example set to
{
"autoscale": {
"min_workers": 1,
"max_workers": 2
},
"spark_version": "5.5.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
}
}
When running the project remotely, this is the output in cmd:
2019/10/04 10:09:50 INFO mlflow.projects: === Fetching project from https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine into C:\Users\ARNTS\AppData\Local\Temp\tmp2qzdyq9_ ===
2019/10/04 10:10:04 INFO mlflow.projects.databricks: === Uploading project to DBFS path /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Finished uploading project to /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Running entry point main of project https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
2019/10/04 10:10:06 INFO mlflow.projects.databricks: === Launched MLflow run as Databricks job run with ID 8. Getting run status page URL... ===
2019/10/04 10:10:18 INFO mlflow.projects.databricks: === Check the run's status at https://<region>.azuredatabricks.net/?o=<databricks-id>#job/8/run/1 ===
Where the DBFS path has a leading '/' before the remaining are '\'.
The command spins up a cluster in databricks and is ready to execute the job, but ends up with the following error message on the databricks side:
rsync: link_stat "/dbfsmlflow-experiments3947403843428882projects-codeaa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.1]
Where we can see the same path but without the '\' inserted. I narrowed down the creation of this path to this file in the MLflow Github repo, where the following code creates the path (line 133):
dbfs_path = os.path.join(DBFS_EXPERIMENT_DIR_BASE, str(experiment_id),
"projects-code", "%s.tar.gz" % tarfile_hash)
dbfs_fuse_uri = os.path.join("/dbfs", dbfs_path)
My current hypothesis is that os.path.join() in the first line joins the string together in a "windows fashion" such that they have backslashes. Then the following call to os.path.join() adds a '/'. The databricks file system is then unable to handle this path and something causes the 'tar.gz' file to not be properly uploaded or to be accessed at the wrong path.
It should also be mentioned that the project runs fine locally.
I'm running the following versions:
Windows 10
Python 3.6.8
MLflow 1.3.0 (also replicated the fault with 1.2.0)
Any feedback or suggestions are greatly appreciated!
Thanks for the catch, you're right that using os.path.join when working with DBFS paths is incorrect, resulting in a malformed path that breaks project execution. I've filed to https://github.com/mlflow/mlflow/issues/1926 track this, if you're interested in making a bugfix PR (see the MLflow contributor guide for info on how to do this) to replace os.path.join here with os.posixpath.join I'd be happy to review :)
Thanks for putting this issue.
I also encountered the same at windows 10.
I resolved this issue, with replacing all 'os.path' to 'posixpath' at 'databricks.py' file.
It worked perfectly fine for me.

Resources