'AzureBlobDatastore' object is not subscriptable - azure-machine-learning-service

I'm trying to create an azure ml pipeline with python steps and I encounter the following issue: "'AzureBlobDatastore' object is not subscriptable" and I run it using a notebook and all cells are running well beside the last one which is fairly a standard one:
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
# create the pipeline
pipeline = Pipeline(ws, steps=[data_prep_step, train_step])
# create the experiment and submit and pipeline run
run1 = Experiment(workspace=ws, name="Skylabs-first-model").submit(pipeline)

Related

Azure ML Pipeline - Error: Message: "Missing data for required field". Path: "environment". value: "null"

I am trying to create a pipeline with Python SDK v2 in Azure Machine Learning Studio. Been stuck on this error for many.. MANY.. hours now, so now I am reaching out.
I have been following this guide: https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-python-sdk
My setup is very similar, but I split "data_prep" into two separate steps, and I am using a custom ml model.
How the pipeline is defined:
`
# the dsl decorator tells the sdk that we are defining an Azure ML pipeline
from azure.ai.ml import dsl, Input, Output
import pathlib
import os
#dsl.pipeline(
compute=cpu_compute_target,
description="Car predict pipeline",
)
def car_predict_pipeline(
pipeline_job_data_input,
pipeline_job_registered_model_name,
):
# using data_prep_function like a python call with its own inputs
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
)
print('-----------------------------------------------')
print(os.path.realpath(str(pipeline_job_data_input)))
print(os.path.realpath(str(data_prep_job.outputs.prepared_data)))
print('-----------------------------------------------')
train_test_split_job = traintestsplit_component(
prepared_data=data_prep_job.outputs.prepared_data
)
# using train_func like a python call with its own inputs
train_job = train_component(
train_data=train_test_split_job.outputs.train_data, # note: using outputs from previous step
test_data=train_test_split_job.outputs.test_data, # note: using outputs from previous step
registered_model_name=pipeline_job_registered_model_name,
)
# a pipeline returns a dictionary of outputs
# keys will code for the pipeline output identifier
return {
# "pipeline_job_train_data": train_job.outputs.train_data,
# "pipeline_job_test_data": train_job.outputs.test_data,
"pipeline_job_model": train_job.outputs.model
}
`
I managed to run every single component successfully, in order, via the command line and produced a trained model. Ergo the components and data works fine, but the pipeline won't run.
I can provide additional info, but I am not sure what is needed and I do not want to clutter the post.
I have tried googling. I have tried comparing the tutorial pipeline with my own. I have tried using print statements to isolate the issue. Nothing has worked so far. Nothing that I have done has changed the error either, it's the same error no matter what.
Edit:
Some additional info about my environment:
from azure.ai.ml.entities import Environment
custom_env_name = "pipeline_test_environment_pricepredict_model"
pipeline_job_env = Environment(
name=custom_env_name,
description="Environment for testing out Jeppes model in pipeline building",
conda_file=os.path.join(dependencies_dir, "conda.yml"),
image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
version="1.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
print(
f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)
Build status of environment. It had already run successfully.
In azure machine learning studio, when the application was running and model was deployed we have the default options to get the curated environments or custom environments. If the environment was created based on the existing deployment, we need to check with the build was successful or not.
Until we get the success in the deployment, we cannot get the environment variables noted into the program and we cannot retrieve the variables through the code block.
Select the environment need to be used.
Choose the existing version created.
We will get the mount location details and the docker file if creating using the docker and conda environment.
The environment and up and running successfully. If the case is running, then using the asset ID or the mount details we can retrieve the environment variables information.
/mnt/batch/tasks/shared/LS_root/mounts/clusters/workspace-name/code/files/docker/Dockerfile

geting artifacts from mlflow GridSearch run

I'm running a sklearn pipeline with hyperparameter search (let's say GridSearch). Now, I am logging artifacts such as test results and whole-dataset predictions. I'd like to retrieve these artifacts but the mlflow API is getting in the way...
import mlflow
mlflow.set_tracking_uri("sqlite:///mlruns/mlruns.db")
mlflow.set_registry_uri("./mlruns/")
run_ids = [r.run_id for r in mlflow.list_run_infos(mlflow.get_experiment_by_name("My Experiment").experiment_id)]
With the above code, I can retrieve all runs but I have no way of telling which one is a toplevel run with artifacts logged or a sub-run spawned by the GridSearch procedure.
Is there some way of querying only for parent runs, so I can retrieve these csv files in order to plot the results? I can of course go to the web api and manually select the run then copy the URI for the file, but I'd like to do it programmatically instead of opening a tab and clicking things.

Apache Beam + Databricks Notebook - map function error

I am trying to run a simple pipeline using Apache Beam on DataBricks Notebooks, but I am unable to create any custom functions. Here is a simple example:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
def my_func(s):
print(s)
pipeline_options = PipelineOptions([
"--runner=DirectRunner",
])
with beam.Pipeline(options=pipeline_options) as p:
(
p
| "Create data" >> beam.Create(['foo', 'bar', 'baz'])
| "print result" >> beam.Map(my_func)
)
Produces:
RuntimeError: Unable to pickle fn CallableWrapperDoFn(<function Map.<locals>.<lambda> at 0x7fb5a17a6b80>): It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
This error also occurs if I use a lambda function. Everything works as expected if I use the builtin print function. Why am I getting this error? How can I pass custom functions to my pipeline in this environment?
Reference to SparkContext seems to be getting pulled in when the lambda function gets serialized by Beam.
Instead of using beam.Map can you try defining a new Beam DoFn and using the ParDo transform ?

Azure ML model deployment fail: Module not found error

I'm trying to deploy a model locally using Azure ML before deploying to AKS. I have a custom script that I want to import into my entry script (scoring script), but it's saying it is not found.
Here is the error:
Here's my entry script with the custom script import on line 1:
import rake_refactored as rake
from operator import itemgetter
import pandas as pd
import datetime
import re
import operator
import numpy as np
import json
# Called when the deployed service starts
def init():
global stopword_path
# AZUREML_MODEL_DIR is an environment variable created during deployment.
# It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)
# For multiple models, it points to the folder containing all deployed models (./azureml-models)
stopword_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'models/SmartStoplist.txt')
# load models
def preprocess(df):
df = rake.prepare_data(df)
text = rake.process_response(df, "RESPNS")
return text
# Use model to make predictions
def predict(df):
text = preprocess(df)
return rake.extract_keywords(stopword_path, text)
def run(data):
try:
# Find the data property of the JSON request
df = pd.read_json(json.loads(data))
prediction = predict(df)
return json.dump(prediction)
except Exception as e:
return str(e)
And here is my model artifact directory in Azure ML showing that it is in the same directory as the entry script (rake_score.py).
What am I doing wrong? I had a similar issue before with a sklearn package that I was able to add to the pip-package list when I built the environment, but my custom script isn't a pip package.
Not able to find rake_refactored in documentation and on the internet.
You can try below steps for importing rake.
Using pip
pip install rake-nltk
Directly from the repository
git clone https://github.com/csurfer/rake-nltk.git
python rake-nltk/setup.py install
Sample Code:
from rake_nltk import Rake
# Uses stopwords for english from NLTK, and all puntuation characters by
# default
r = Rake()
# Extraction given the text.
r.extract_keywords_from_text(<text to process>)
# Extraction given the list of strings where each string is a sentence.
r.extract_keywords_from_sentences(<list of sentences>)
# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()
# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()
Refer - https://github.com/csurfer/rake-nltk
In order to access my custom script in my scoring script I needed to explicitly define the source directory in my inference configuration:
from azureml.core.model import InferenceConfig
inference_config = InferenceConfig(
environment = env,
entry_script = "rake_score.py",
source_directory='./models'
)

Azure AutoML download metrics

I was wondering if there is a way to download the metrics for a model after a run has completed in AutoML in Azure? For example, I want to download the generated confusion matrix as a png file along with the other available metrics.
You can use AutoMLRun's get_output() method to do so -- check out this notebook example.
If you're using the UI to create AutoML runs, or need an output from a previously submitted run, you'll have to create a new AutoMLRun() instance using an Experiment object and the run_id, like below.
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
froom azureml.train.automl.run import AutoMLRun
ws = Workspace.from_config()
experiment_name = 'YOUREXPERIMENTNAME'
experiment = Experiment(ws, experiment_name)
run_automl = AutoMLRun(experiment, run_id="YOUR RUN ID")
best_run, fitted_model = remote_run.get_output()
You cannot download the confusion matrix or other visualizations from AutoML. You can get a link to the UI from the run and view visualizations there. Why do you need this from the Python SDK?
Also, you can see visualizations through the RunDetails widget.

Resources