geting artifacts from mlflow GridSearch run - mlflow

I'm running a sklearn pipeline with hyperparameter search (let's say GridSearch). Now, I am logging artifacts such as test results and whole-dataset predictions. I'd like to retrieve these artifacts but the mlflow API is getting in the way...
import mlflow
mlflow.set_tracking_uri("sqlite:///mlruns/mlruns.db")
mlflow.set_registry_uri("./mlruns/")
run_ids = [r.run_id for r in mlflow.list_run_infos(mlflow.get_experiment_by_name("My Experiment").experiment_id)]
With the above code, I can retrieve all runs but I have no way of telling which one is a toplevel run with artifacts logged or a sub-run spawned by the GridSearch procedure.
Is there some way of querying only for parent runs, so I can retrieve these csv files in order to plot the results? I can of course go to the web api and manually select the run then copy the URI for the file, but I'd like to do it programmatically instead of opening a tab and clicking things.

Related

Azure ML Pipeline - Error: Message: "Missing data for required field". Path: "environment". value: "null"

I am trying to create a pipeline with Python SDK v2 in Azure Machine Learning Studio. Been stuck on this error for many.. MANY.. hours now, so now I am reaching out.
I have been following this guide: https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-python-sdk
My setup is very similar, but I split "data_prep" into two separate steps, and I am using a custom ml model.
How the pipeline is defined:
`
# the dsl decorator tells the sdk that we are defining an Azure ML pipeline
from azure.ai.ml import dsl, Input, Output
import pathlib
import os
#dsl.pipeline(
compute=cpu_compute_target,
description="Car predict pipeline",
)
def car_predict_pipeline(
pipeline_job_data_input,
pipeline_job_registered_model_name,
):
# using data_prep_function like a python call with its own inputs
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
)
print('-----------------------------------------------')
print(os.path.realpath(str(pipeline_job_data_input)))
print(os.path.realpath(str(data_prep_job.outputs.prepared_data)))
print('-----------------------------------------------')
train_test_split_job = traintestsplit_component(
prepared_data=data_prep_job.outputs.prepared_data
)
# using train_func like a python call with its own inputs
train_job = train_component(
train_data=train_test_split_job.outputs.train_data, # note: using outputs from previous step
test_data=train_test_split_job.outputs.test_data, # note: using outputs from previous step
registered_model_name=pipeline_job_registered_model_name,
)
# a pipeline returns a dictionary of outputs
# keys will code for the pipeline output identifier
return {
# "pipeline_job_train_data": train_job.outputs.train_data,
# "pipeline_job_test_data": train_job.outputs.test_data,
"pipeline_job_model": train_job.outputs.model
}
`
I managed to run every single component successfully, in order, via the command line and produced a trained model. Ergo the components and data works fine, but the pipeline won't run.
I can provide additional info, but I am not sure what is needed and I do not want to clutter the post.
I have tried googling. I have tried comparing the tutorial pipeline with my own. I have tried using print statements to isolate the issue. Nothing has worked so far. Nothing that I have done has changed the error either, it's the same error no matter what.
Edit:
Some additional info about my environment:
from azure.ai.ml.entities import Environment
custom_env_name = "pipeline_test_environment_pricepredict_model"
pipeline_job_env = Environment(
name=custom_env_name,
description="Environment for testing out Jeppes model in pipeline building",
conda_file=os.path.join(dependencies_dir, "conda.yml"),
image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
version="1.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
print(
f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)
Build status of environment. It had already run successfully.
In azure machine learning studio, when the application was running and model was deployed we have the default options to get the curated environments or custom environments. If the environment was created based on the existing deployment, we need to check with the build was successful or not.
Until we get the success in the deployment, we cannot get the environment variables noted into the program and we cannot retrieve the variables through the code block.
Select the environment need to be used.
Choose the existing version created.
We will get the mount location details and the docker file if creating using the docker and conda environment.
The environment and up and running successfully. If the case is running, then using the asset ID or the mount details we can retrieve the environment variables information.
/mnt/batch/tasks/shared/LS_root/mounts/clusters/workspace-name/code/files/docker/Dockerfile

Azure AutoML download metrics

I was wondering if there is a way to download the metrics for a model after a run has completed in AutoML in Azure? For example, I want to download the generated confusion matrix as a png file along with the other available metrics.
You can use AutoMLRun's get_output() method to do so -- check out this notebook example.
If you're using the UI to create AutoML runs, or need an output from a previously submitted run, you'll have to create a new AutoMLRun() instance using an Experiment object and the run_id, like below.
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
froom azureml.train.automl.run import AutoMLRun
ws = Workspace.from_config()
experiment_name = 'YOUREXPERIMENTNAME'
experiment = Experiment(ws, experiment_name)
run_automl = AutoMLRun(experiment, run_id="YOUR RUN ID")
best_run, fitted_model = remote_run.get_output()
You cannot download the confusion matrix or other visualizations from AutoML. You can get a link to the UI from the run and view visualizations there. Why do you need this from the Python SDK?
Also, you can see visualizations through the RunDetails widget.

How to import custom functions on my experiment script for Azure ML?

I can successfully submit an experiment to processing on a remote compute target on Azure ML.
In my notebook, for submitting the experiment, I have:
# estimator
estimator = Estimator(
source_directory='scripts',
entry_script='exp01.py',
compute_target='pc2',
conda_packages=['scikit-learn'],
inputs=[data.as_named_input('my_dataset')],
)
# Submit
exp = Experiment(workspace=ws, name='my_exp')
# Run the experiment based on the estimator
run = exp.submit(config=estimator)
RunDetails(run).show()
run.wait_for_completion(show_output=True)
However, in order to keep things clean, I want to define my general use functions on an auxiliary script, so the first will import it.
On my script experiment file exp01.py, I wanted:
import custom_functions as custom
# azure experiment start
run = Run.get_context()
# the data from azure datasets/datastorage
df = run.input_datasets['my_dataset'].to_pandas_dataframe()
# prepare data
df_transformed = custom.prepare_data(df)
# split data
X_train, X_test, y_train, y_test = custom.split_data(df_transformed)
# run my models.....
model_name = 'RF'
model = custom.model_x(model_name, a_lot_of_args)
# log the results
run.log(model_name, results)
# azure finish
run.complete()
The thing is: Azure wont let me import the custom_functions.py.
How are you doing it?
TL;DR any files you put inside the source_directory in your case, scripts will be available to the Estimator.
To make this happen, simply create a file called custom_functions.py in the scripts folder that contains your prepare_data(), split_data(), model_x() functions.
I also recommend that you include only exactly what you need in the source_directory folder and make distinct folders for each Estimator because:
the entire folder's contents will be uploaded when you use a remote compute_target, and
when you started using ML Pipeilnes (which are awesome), PythonScriptSteps allow_reuse parameter will look to see if any files in the source_directory have changed when determining if the step needs to run again or not.
Lastly, when you want to share general utility functions across PythonScriptSteps or Estimators without having to copy and paste code, that's when you might want to consider creating a custom python package.

Azure ML SDK DataReference - File Pattern - MANY files

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

CrfSharp file not found

when I try to run crfsharp, I get the following error at VS2012,
+err{"Could not find file 'C:\codeplex\POIParser\data\training\POIParser_corpus.train.tag'.":"C:\codeplex\POIParser\data\training\POIParser_corpus.train.tag"} System.Exception {System.IO.FileNotFoundException}
where can I find this file "POIParser_corpus.train.tag" ? I have downloaded both source code and main program of crfsharp and running it in VS2012.
Also I want to ask you can I use the CRFsharp to extract aspects by using training templates?
How do you run it ?
To train a CRF model, you need to prepare training corpus, template file at first and run CRFSharpConsole.exe with some parameters. CRFSharpConsole.exe will show usage, if you run it without any parameters.
Actually, I recommend you to download demo package from [DOWNLOADS] section in CRFSharp project web site(http://crfsharp.codeplex.com) at first, and then play with demo. In demo package, it will show you how to run CRFSharp in command line. For example, you can download Named entity recognized demo in English demo and run batch file to train a new model and test it.
For POIParser_corpus.train.tag you mentioned, it's the training corpus for Chinese POI inner-structure parser. You can also download it and run build_model.bat to train the model, and run test_model.bat to test it.

Resources