How to Deploy trained TensorFlow 2.0 models using Amazon SageMaker? - python-3.x

i am trying to deploy a custom trained tensorflow model using Amazon SageMaker. i have trained xlm roberta using tf 2.2.0 for multilingual sentiment analysis task.(please refer to this notebook : https://www.kaggle.com/mobassir/understanding-cross-lingual-models)
now, using trained weight file of my model i am trying to deploy that in sagemaker, i was following this tutorial : https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/
converted some keras code from there to tensorflow.keras for 2.2.0
but when i do : !ls export/Servo/1/variables i can see that export as Savedmodel generating empty variables directory like this : https://github.com/tensorflow/models/issues/1988
i can't find any documentation help for tf 2.2.0 trained model deployment
need example like this : https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ for tf 2.x models and not keras
even though !ls export/Servo/1/variables shows empty directory but An endpoint was created successfully and now i am not sure if my model was deployed successfully or not because when i try to test the model deployment inside aws notebook by using predictor = sagemaker.tensorflow.model.TensorFlowPredictor(endpoint_name, sagemaker_session)
i.e. predictor.predict(data) i get the following error message:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message "{
"error": "Session was not created with a graph before Run()!"
}"
related problem : Inference error with TensorFlow C++ on iOS: "Invalid argument: Session was not created with a graph before Run()!"
the code i tried can be found here : https://pastebin.com/sGuTtnSD

Related

Databricks error when trying to load a model

I am trying to train a model in databricks with mlflow and roberta transformers. I am able to register the model but when I call it for testing, I got the following error:
OSError: We couldn't connect to 'https://huggingface.co/' to load this model and it looks like dbfs:/databricks/mlflow-tracking/3638851642935524/cd4eae6034684211933b97b178e5f062/artifacts/checkpoint-36132/artifacts/checkpoint-36132 is not the path to a directory containing a config.json file.
However, when I check for the model saved, I can see that the config.json and other files are saved in the mentioned artifact but with the following error:
Couldn't load model information due to an error.
For more details, I have followed the following link for creating an mlflow with transformers on databricks:
https://gitlab.com/juliensimon/huggingface-demos/-/tree/main/mlflow

I can not register a model in my Azure ml experiment using run context

I am trying to register a model inside one of my azure ml experiments. I am able to register it via Model.register but not via run_context.register_model
This are the two code sentences I use. The commented one is the one that fails
learn.path = Path('./outputs').absolute()
Model.register(run_context.experiment.workspace, "outputs/login_classification.pkl","login_classification", tags=metrics)
run_context.register_model("login_classification", "outputs/login_classification.pkl", tags=metrics)
I received the next error:
Message: Could not locate the provided model_path outputs/login_classification.pkl
But model is stored in this path:
Before implementing run_context.register_model() implement run_context = Run.get_context()
I was able to fix the problem by explicitly uploading the model into the run history record before trying for registering the model.
run.upload_file("output/model.pickle", "output/model.pickle")
Check the documentation for Message: Could not locate the provided model_path outputs/login_classification.pkl
To check about Run Class

MLFlow Model Registry ENDPOINT_NOT_FOUND: No API found for ERROR

I'm currently using MLFlow in Azure Databricks and trying to load a model from the Model Registry. Currently referencing the version, but will want to reference the stage 'Production' (I get the same error when referencing the stage as well)
I keep encountering an error:
ENDPOINT_NOT_FOUND: No API found for 'POST /mlflow/model-versions/get-download-uri'
My artifacts are stored in the dbfs filestore.
I have not been able to identify why this is happening.
Code:
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
import mlflow.pyfunc
model_name = "model_name"
model_version_uri = "models:/{model_name}/4".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_version_uri))
model_version_4 = mlflow.pyfunc.load_model(model_version_uri)
model_production_uri = "models:/{model_name}/production".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_production_uri))
model_production = mlflow.pyfunc.load_model(model_production_uri)

Azure ML: how to access logs of a failed Model deployment

I'm deploying a Keras model that is failing with the error below. The exception says that I can retrieve the logs by running "print(service.get_logs())", but that's giving me empty results. I am deploying the model from my AzureNotebook and I'm using the same "service" var to retrieve the logs.
Also, how can i retrieve the logs from the container instance? I'm deploying to an AKS compute cluster I created. Sadly, the docs link in the exception also doesnt detail how to retrieve these logs.
More information can be found using '.get_logs()' Error:
{ "code":
"KubernetesDeploymentFailed", "statusCode": 400, "message":
"Kubernetes Deployment failed", "details": [
{
"code": "CrashLoopBackOff",
"message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check
the logs for your container instance: my-model-service. From
the AML SDK, you can run print(service.get_logs()) if you have service
object to fetch the logs. \nYou can also try to run image
mlwks.azurecr.io/azureml/azureml_3c0c34b65cf18c8644e8d745943ab7d2:latest
locally. Please refer to http://aka.ms/debugimage#service-launch-fails
for more information."
} ] }
UPDATE
Here's my code to deploy the model:
environment = Environment('my-environment')
environment.python.conda_dependencies = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataprep[pandas,fuse]","tensorflow", "keras", "matplotlib"])
service_name = 'my-model-service'
# Remove any existing service under the same name.
try:
Webservice(ws, service_name).delete()
except WebserviceException:
pass
inference_config = InferenceConfig(entry_script='score.py', environment=environment)
comp = ComputeTarget(workspace=ws, name="ml-inference-dev")
service = Model.deploy(workspace=ws,
name=service_name,
models=[model],
inference_config=inference_config,
deployment_target=comp
)
service.wait_for_deployment(show_output=True)
And my score.py
import joblib
import numpy as np
import os
import keras
from keras.models import load_model
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
def init():
global model
model_path = Model.get_model_path('model.h5')
model = load_model(model_path)
model = keras.models.load_model(model_path)
# The run() method is called each time a request is made to the scoring API.
#
# Shown here are the optional input_schema and output_schema decorators
# from the inference-schema pip package. Using these decorators on your
# run() method parses and validates the incoming payload against
# the example input you provide here. This will also generate a Swagger
# API document for your web service.
#input_schema('data', NumpyParameterType(np.array([[0.1, 1.2, 2.3, 3.4, 4.5, 5.6, 6.7, 7.8, 8.9, 9.0]])))
#output_schema(NumpyParameterType(np.array([4429.929236457418])))
def run(data):
return [123] #test
Update 2:
Here is a screencap of the endpoint page. Is it normal for the CPU to be .1? Also, when i hit the swagger url in the browser, i get the error: "No ready replicas for service doc-classify-env-service"
Update 3
After finally getting to the container logs, it turns out that it was choking with this error on my score.py
ModuleNotFoundError: No module named 'inference_schema'
I then ran a test that commented out the refs for "input_schema" and "output_schema" and also simplified my pip_packages and the REST endpoint come up! I was also able to get a prediction out of the model.
pip_packages=["azureml-defaults","tensorflow", "keras"])
So my question is, how should I have my pip_packages for the scoring file to utilize the inference_schema decorators? I'm assuming I need to include azureml-sdk[auotml] pip package, but when i do so, the image creation fails and I see several dependency conflicts.
Try retrieving your service from the workspace directly
ws.webservices[service_name].get_logs()
Also, I found deploying an image as an endpoint to be easier than inference+deploy model (depending on your use case)
my_image = Image(ws, name='test', version='26')
service = AksWebservice.deploy_from_image(ws, "test1", my_image, deployment_config, aks_target)

Unable to register an ONNX model in azure machine learning service workspace

I was trying to register an ONNX model to Azure Machine Learning service workspace in two different ways, but I am getting errors I couldn't solve.
First method: Via Jupyter Notebook and python Script
model = Model.register(model_path = MODEL_FILENAME,
model_name = "MyONNXmodel",
tags = {"onnx":"V0"},
description = "test",
workspace = ws)
The error is : HttpOperationError: Operation returned an invalid status code 'Service invocation failed!Request: GET https://cert-westeurope.experiments.azureml.net/rp/workspaces'
Second method: Via Azure Portal
Anyone can help please?
error 413 means the payload is too large. Using Azure portal, you can only upload a model upto 25MB in size. Please use python SDK to upload models larger than 25MB.

Resources