I was wondering if there is a way to download the metrics for a model after a run has completed in AutoML in Azure? For example, I want to download the generated confusion matrix as a png file along with the other available metrics.
You can use AutoMLRun's get_output() method to do so -- check out this notebook example.
If you're using the UI to create AutoML runs, or need an output from a previously submitted run, you'll have to create a new AutoMLRun() instance using an Experiment object and the run_id, like below.
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
froom azureml.train.automl.run import AutoMLRun
ws = Workspace.from_config()
experiment_name = 'YOUREXPERIMENTNAME'
experiment = Experiment(ws, experiment_name)
run_automl = AutoMLRun(experiment, run_id="YOUR RUN ID")
best_run, fitted_model = remote_run.get_output()
You cannot download the confusion matrix or other visualizations from AutoML. You can get a link to the UI from the run and view visualizations there. Why do you need this from the Python SDK?
Also, you can see visualizations through the RunDetails widget.
Related
I'm running a sklearn pipeline with hyperparameter search (let's say GridSearch). Now, I am logging artifacts such as test results and whole-dataset predictions. I'd like to retrieve these artifacts but the mlflow API is getting in the way...
import mlflow
mlflow.set_tracking_uri("sqlite:///mlruns/mlruns.db")
mlflow.set_registry_uri("./mlruns/")
run_ids = [r.run_id for r in mlflow.list_run_infos(mlflow.get_experiment_by_name("My Experiment").experiment_id)]
With the above code, I can retrieve all runs but I have no way of telling which one is a toplevel run with artifacts logged or a sub-run spawned by the GridSearch procedure.
Is there some way of querying only for parent runs, so I can retrieve these csv files in order to plot the results? I can of course go to the web api and manually select the run then copy the URI for the file, but I'd like to do it programmatically instead of opening a tab and clicking things.
I am sure this is something basic but I have been banging my head against the wall for a while now and I can't figure it out.
I have trained and registered a model using automl in AzureML. The model is visible in the registry.
When I try to load it in order to do something with it, I use this basic/standard code:
from azureml.core.model import Model
import joblib
from azureml.core import Workspace
from azureml.core.environment import Environment
ws = Workspace.from_config()
model_obj = Model(ws, "ModelName")
model_path = model_obj.download(exist_ok = True)
model = joblib.load(model_path)
And I get this lovely error
ImportError: cannot import name 'HoltWintersResultsWrapper' from 'statsmodels.tsa.holtwinters' (/anaconda/envs/azureml_py38/lib/python3.8/site-packages/statsmodels/tsa/holtwinters/__init__.py)
My statsmodels and automl packages are updated.
I have even tried removing exponential models from the automl configuration to see if it was a specific issue with these models.
I have also tried changing the environment to a curated one but nothing seems to work.
I didn't get anywhere online as well so here I am.
Does anyone know what the heck is going on here?
Thanks!
The issue is with the way we are calling the module. Some of the modules are dependent ion calling, they must be called with their parent package name. In the function calling the HoltWintersResultsWrapper, replace the existing calling method with the function.
Check with the document procedure designed by programtalk
**
assert isinstance(
statsmodels_loaded,
statsmodels.tsa.holtwinters.results.HoltWintersResultsWrapper,
)
**
PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']
I can successfully submit an experiment to processing on a remote compute target on Azure ML.
In my notebook, for submitting the experiment, I have:
# estimator
estimator = Estimator(
source_directory='scripts',
entry_script='exp01.py',
compute_target='pc2',
conda_packages=['scikit-learn'],
inputs=[data.as_named_input('my_dataset')],
)
# Submit
exp = Experiment(workspace=ws, name='my_exp')
# Run the experiment based on the estimator
run = exp.submit(config=estimator)
RunDetails(run).show()
run.wait_for_completion(show_output=True)
However, in order to keep things clean, I want to define my general use functions on an auxiliary script, so the first will import it.
On my script experiment file exp01.py, I wanted:
import custom_functions as custom
# azure experiment start
run = Run.get_context()
# the data from azure datasets/datastorage
df = run.input_datasets['my_dataset'].to_pandas_dataframe()
# prepare data
df_transformed = custom.prepare_data(df)
# split data
X_train, X_test, y_train, y_test = custom.split_data(df_transformed)
# run my models.....
model_name = 'RF'
model = custom.model_x(model_name, a_lot_of_args)
# log the results
run.log(model_name, results)
# azure finish
run.complete()
The thing is: Azure wont let me import the custom_functions.py.
How are you doing it?
TL;DR any files you put inside the source_directory in your case, scripts will be available to the Estimator.
To make this happen, simply create a file called custom_functions.py in the scripts folder that contains your prepare_data(), split_data(), model_x() functions.
I also recommend that you include only exactly what you need in the source_directory folder and make distinct folders for each Estimator because:
the entire folder's contents will be uploaded when you use a remote compute_target, and
when you started using ML Pipeilnes (which are awesome), PythonScriptSteps allow_reuse parameter will look to see if any files in the source_directory have changed when determining if the step needs to run again or not.
Lastly, when you want to share general utility functions across PythonScriptSteps or Estimators without having to copy and paste code, that's when you might want to consider creating a custom python package.
I'm using a Python 3.4 Jupyter notebook to load a dataset in Azure ML which is stored in the cloud as a dataset in the Azure ML project environment. But using the default template created by Azure ML, I can't load the data due to a mixed datatypes error.
from azureml import Workspace
import pandas as pd
ws = Workspace()
ds = ws.datasets['rossmann-train.csv']
df = ds.to_dataframe()
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/kernel/main.py:6: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
In my local environment I just import the dataset as follows:
df = pd.read_csv('train.csv',low_memory=False)
But I'm not sure how to do this in azure using the ds object.
df = pd.read_csv(ds)
and
pd.DataFrame.from_csv(ds)
raise the error:
OSError: Expected file path name or file-like object, got type
*edit: more info on the ds object:
In [1]: type(ds)
Out [1]: azureml.SourceDataset
In [2]: print (ds)
Out [2]: rossmann-train.csv
First of all, I am not sure, by your question, what is the ds object. But I'm pretty sure it is not a csv file, since, if it were, you'd have processed it your self and you wouldn't be having this question.
Now, I am not sure whether pandas has a native way of dealing with Azure, but this piece of documentation indicates that first you must download the data form Azure, using their package, and save it into your local file system.
But for that, they are assuming that the data you downloaded is already in the csv format. If not, use the appropriate reader (or parse it by hand) in order to tabulate the data for a pandas.DataFrame.
According to the docs on the azureml library, one workaround would be to import the file as text then parse it into csv but this seems unnecessary since the data is already recognised as being in csv structure.
text_data = ds.read_as_text()