Azure ML Pipeline - Error: Message: "Missing data for required field". Path: "environment". value: "null" - azure

I am trying to create a pipeline with Python SDK v2 in Azure Machine Learning Studio. Been stuck on this error for many.. MANY.. hours now, so now I am reaching out.
I have been following this guide: https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-python-sdk
My setup is very similar, but I split "data_prep" into two separate steps, and I am using a custom ml model.
How the pipeline is defined:
`
# the dsl decorator tells the sdk that we are defining an Azure ML pipeline
from azure.ai.ml import dsl, Input, Output
import pathlib
import os
#dsl.pipeline(
compute=cpu_compute_target,
description="Car predict pipeline",
)
def car_predict_pipeline(
pipeline_job_data_input,
pipeline_job_registered_model_name,
):
# using data_prep_function like a python call with its own inputs
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
)
print('-----------------------------------------------')
print(os.path.realpath(str(pipeline_job_data_input)))
print(os.path.realpath(str(data_prep_job.outputs.prepared_data)))
print('-----------------------------------------------')
train_test_split_job = traintestsplit_component(
prepared_data=data_prep_job.outputs.prepared_data
)
# using train_func like a python call with its own inputs
train_job = train_component(
train_data=train_test_split_job.outputs.train_data, # note: using outputs from previous step
test_data=train_test_split_job.outputs.test_data, # note: using outputs from previous step
registered_model_name=pipeline_job_registered_model_name,
)
# a pipeline returns a dictionary of outputs
# keys will code for the pipeline output identifier
return {
# "pipeline_job_train_data": train_job.outputs.train_data,
# "pipeline_job_test_data": train_job.outputs.test_data,
"pipeline_job_model": train_job.outputs.model
}
`
I managed to run every single component successfully, in order, via the command line and produced a trained model. Ergo the components and data works fine, but the pipeline won't run.
I can provide additional info, but I am not sure what is needed and I do not want to clutter the post.
I have tried googling. I have tried comparing the tutorial pipeline with my own. I have tried using print statements to isolate the issue. Nothing has worked so far. Nothing that I have done has changed the error either, it's the same error no matter what.
Edit:
Some additional info about my environment:
from azure.ai.ml.entities import Environment
custom_env_name = "pipeline_test_environment_pricepredict_model"
pipeline_job_env = Environment(
name=custom_env_name,
description="Environment for testing out Jeppes model in pipeline building",
conda_file=os.path.join(dependencies_dir, "conda.yml"),
image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
version="1.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
print(
f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)
Build status of environment. It had already run successfully.

In azure machine learning studio, when the application was running and model was deployed we have the default options to get the curated environments or custom environments. If the environment was created based on the existing deployment, we need to check with the build was successful or not.
Until we get the success in the deployment, we cannot get the environment variables noted into the program and we cannot retrieve the variables through the code block.
Select the environment need to be used.
Choose the existing version created.
We will get the mount location details and the docker file if creating using the docker and conda environment.
The environment and up and running successfully. If the case is running, then using the asset ID or the mount details we can retrieve the environment variables information.
/mnt/batch/tasks/shared/LS_root/mounts/clusters/workspace-name/code/files/docker/Dockerfile

Related

geting artifacts from mlflow GridSearch run

I'm running a sklearn pipeline with hyperparameter search (let's say GridSearch). Now, I am logging artifacts such as test results and whole-dataset predictions. I'd like to retrieve these artifacts but the mlflow API is getting in the way...
import mlflow
mlflow.set_tracking_uri("sqlite:///mlruns/mlruns.db")
mlflow.set_registry_uri("./mlruns/")
run_ids = [r.run_id for r in mlflow.list_run_infos(mlflow.get_experiment_by_name("My Experiment").experiment_id)]
With the above code, I can retrieve all runs but I have no way of telling which one is a toplevel run with artifacts logged or a sub-run spawned by the GridSearch procedure.
Is there some way of querying only for parent runs, so I can retrieve these csv files in order to plot the results? I can of course go to the web api and manually select the run then copy the URI for the file, but I'd like to do it programmatically instead of opening a tab and clicking things.

Issues Triggering Dataflow Job from Cloud Function: ModuleNotFoundError: No module named 'functions_framework'

I am doing a simple Cloud Function based on a file upload into GCS, this would trigger a Dataflow job. For the sake of simplicity, my current pipeline simply reads the file from GCS and then writes it to another bucket. While this Dataflow job works well without Cloud Function, Cloud Function does something else. It logs the file details correctly, it triggers a Dataflow job, but then Dataflow fails with a "module not found" error. Hence, while the function executes and triggers the job properly, the Dataflow job does not come through. Here is the code that I have:
def hello_gcs(event, context):
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
input_file = f"gs://{event['bucket']}/{event['name']}"
output_path = 'gs://<gcs_output_path>'
dataflow_options = ['--project=<project_name>', '--runner=DataflowRunner','--region=<region>','--temp_location=gs://<temp_location>']
options = PipelineOptions(dataflow_options, save_main_session = True)
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
print('Created: {}'.format(event['timeCreated']))
print('Updated: {}'.format(event['updated']))
p = beam.Pipeline(options=options)
print_files = (p | beam.io.ReadFromText(input_file) | beam.io.WriteToText(output_path, file_name_suffix='.txt'))
result = p.run()
I also have a "requirements.txt" file added in the same directory as my function for the following two dependencies:
apache-beam[gcp]==2.39.0
functions-framework==3.*
I have seen in multiple comments that making a Dataflow template bypasses this issue, but I am wondering if anyone may have an idea why this error is being thrown, if it can be circumvented through modification of the current setup, and if not, how to alternately create a template such that this input file can be fed as a parameter?
Thank you!
This is probably a limitation of the save_main_session approach to staging dependencies. The functions-framework is not needed for Beam or Dataflow, but is just something that is loaded into the interpreter during the execution of your Cloud Function.
I suggest disabling the save_main_session option and/or using the --requirements_file or --setup_file options to provide a specification of the dependencies your pipeline will need at runtime.
Detailed documentation for dependency management is at https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

How to import custom functions on my experiment script for Azure ML?

I can successfully submit an experiment to processing on a remote compute target on Azure ML.
In my notebook, for submitting the experiment, I have:
# estimator
estimator = Estimator(
source_directory='scripts',
entry_script='exp01.py',
compute_target='pc2',
conda_packages=['scikit-learn'],
inputs=[data.as_named_input('my_dataset')],
)
# Submit
exp = Experiment(workspace=ws, name='my_exp')
# Run the experiment based on the estimator
run = exp.submit(config=estimator)
RunDetails(run).show()
run.wait_for_completion(show_output=True)
However, in order to keep things clean, I want to define my general use functions on an auxiliary script, so the first will import it.
On my script experiment file exp01.py, I wanted:
import custom_functions as custom
# azure experiment start
run = Run.get_context()
# the data from azure datasets/datastorage
df = run.input_datasets['my_dataset'].to_pandas_dataframe()
# prepare data
df_transformed = custom.prepare_data(df)
# split data
X_train, X_test, y_train, y_test = custom.split_data(df_transformed)
# run my models.....
model_name = 'RF'
model = custom.model_x(model_name, a_lot_of_args)
# log the results
run.log(model_name, results)
# azure finish
run.complete()
The thing is: Azure wont let me import the custom_functions.py.
How are you doing it?
TL;DR any files you put inside the source_directory in your case, scripts will be available to the Estimator.
To make this happen, simply create a file called custom_functions.py in the scripts folder that contains your prepare_data(), split_data(), model_x() functions.
I also recommend that you include only exactly what you need in the source_directory folder and make distinct folders for each Estimator because:
the entire folder's contents will be uploaded when you use a remote compute_target, and
when you started using ML Pipeilnes (which are awesome), PythonScriptSteps allow_reuse parameter will look to see if any files in the source_directory have changed when determining if the step needs to run again or not.
Lastly, when you want to share general utility functions across PythonScriptSteps or Estimators without having to copy and paste code, that's when you might want to consider creating a custom python package.

Delete a run in the experiment of mlflow from the UI so the run does not exist in backend store

I found deleting a run only change the state from active to deleted, because the run is still visible in the UI if searching by deleted.
Is it possible to remove a run from the UI to save the space?
When removing a run, does the artifact correspond to the run is also removed?
If not, can the run be removed through rest call?
The accepted answer indeed deletes the experiment, not the run of an experiment.
In order to remove the directory one can use the mlflow API. Here is the script that removes all deleted runs:
import mlflow
import shutil
def get_run_dir(artifacts_uri):
return artifacts_uri[7:-10]
def remove_run_dir(run_dir):
shutil.rmtree(run_dir, ignore_errors=True)
experiment_id = 1
deleted_runs = 2
exp = mlflow.tracking.MlflowClient(tracking_uri='./mlflow/mlruns')
runs = exp.search_runs(str(experiment_id), run_view_type=deleted_runs)
_ = [remove_run_dir(get_run_dir(run.info.artifact_uri)) for run in runs]
You can't do it via the web UI but you can from a python terminal
import mlflow
mlflow.delete_experiment(69)
Where 69 is the experiment ID
Whilst Grzegorz already provided a solution, I just wanted to provide an alternative solution using the MLFlow cli.
The cli has a command, mlfLow gc, which deletes runs in the deleted lifecycle stage.
check https://mlflow.org/docs/latest/cli.html#mlflow-gc

Saving virtual machine into an image using azure python sdk

I have been working with Microsoft Azure to build virtual machines using the Azure SDK for Python, and now I want to create a managed image from an existing virtual machine.
I saw that there is a way to do it in power shell here
But after a long research i didn't find how to do it in python sdk.
My goal is to be able to save a virtual machine into an image and load it
afterwards (I'm using the ARM and not the ASM).
After a long time trying to figure this out I was finally able to capture an image from a vm.
firstly, the vm needs to be dealocated and generalized:
# Deallocate
async_vm_deallocate = self.compute_client.virtual_machines.deallocate(resource_group.name, names.vm)
async_vm_deallocate.wait()
# Generalize (possible because deallocated)
self.compute_client.virtual_machines.generalize(resource_group.name, names.vm)
I found that there are 2 options to creating the image:
compute_client.virtual_machines.capture(resource_group_name=resource_group.name, vm_name=vm.name, parameters=parameters)
this way requires creating a ComputeManagmentClient and the following import:
from azure.mgmt.compute.v2015_06_15.models import VirtualMachineCaptureParameters
the parameters have to be object type: ~azure.mgmt.compute.v2015_06_15.models.VirtualMachineCaptureParameters.
The object VirtualMachineCaptureParameters has 3 required params:
vhd_name_prefix (str), destination container name (str), overwrite vhds (bool)
what these are, I have no idea and there is no explanation as to what they are. so I didnt use this way
(the way I choose to use) compute_client.images.create_or_update(resource_group_name=resource_group, image_name=unique_name, parameters=params)
this way requires creating a ComputeManagmentClient and the following import:
from azure.mgmt.compute.v2020_06_01.models import Image, SubResource
and it is pretty straight forward
sub_resource = SubResource(id=vm.id)
params = Image(location=LOCATION, source_virtual_machine=sub_resource)
i = compute_client.images.create_or_update(resource_group_name=resource_group, image_name=image_name, parameters=params)
i.wait()
creating the SubResource() and the Image() objects is mandatory as that is the object type expected
Create a compute client:
https://github.com/Azure-Samples/virtual-machines-python-manage
Deallocate and generalize:
# Deallocate
async_vm_deallocate = self.compute_client.virtual_machines.deallocate(resource_group.name, names.vm)
async_vm_deallocate.wait()
# Generalize (possible because deallocated)
self.compute_client.virtual_machines.generalize(resource_group.name, names.vm)
Create an image, there an operation group compute_client.images. I have no exact example like yours, but see this one to create an image from a blob (can be adapted to your scenario):
https://learn.microsoft.com/python/azure/python-sdk-azure-samples-managed-disks#create-an-image-from-blob-storage

Resources