Azure ML: how to access logs of a failed Model deployment

Azure ML: how to access logs of a failed Model deployment - azure-machine-learning-service

I'm deploying a Keras model that is failing with the error below. The exception says that I can retrieve the logs by running "print(service.get_logs())", but that's giving me empty results. I am deploying the model from my AzureNotebook and I'm using the same "service" var to retrieve the logs.
Also, how can i retrieve the logs from the container instance? I'm deploying to an AKS compute cluster I created. Sadly, the docs link in the exception also doesnt detail how to retrieve these logs.
More information can be found using '.get_logs()' Error:
{ "code":
"KubernetesDeploymentFailed", "statusCode": 400, "message":
"Kubernetes Deployment failed", "details": [
{
"code": "CrashLoopBackOff",
"message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check
the logs for your container instance: my-model-service. From
the AML SDK, you can run print(service.get_logs()) if you have service
object to fetch the logs. \nYou can also try to run image
mlwks.azurecr.io/azureml/azureml_3c0c34b65cf18c8644e8d745943ab7d2:latest
locally. Please refer to http://aka.ms/debugimage#service-launch-fails
for more information."
} ] }
UPDATE
Here's my code to deploy the model:
environment = Environment('my-environment')
environment.python.conda_dependencies = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataprep[pandas,fuse]","tensorflow", "keras", "matplotlib"])
service_name = 'my-model-service'
# Remove any existing service under the same name.
try:
Webservice(ws, service_name).delete()
except WebserviceException:
pass
inference_config = InferenceConfig(entry_script='score.py', environment=environment)
comp = ComputeTarget(workspace=ws, name="ml-inference-dev")
service = Model.deploy(workspace=ws,
name=service_name,
models=[model],
inference_config=inference_config,
deployment_target=comp
)
service.wait_for_deployment(show_output=True)
And my score.py
import joblib
import numpy as np
import os
import keras
from keras.models import load_model
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
def init():
global model
model_path = Model.get_model_path('model.h5')
model = load_model(model_path)
model = keras.models.load_model(model_path)
# The run() method is called each time a request is made to the scoring API.
#
# Shown here are the optional input_schema and output_schema decorators
# from the inference-schema pip package. Using these decorators on your
# run() method parses and validates the incoming payload against
# the example input you provide here. This will also generate a Swagger
# API document for your web service.
#input_schema('data', NumpyParameterType(np.array([[0.1, 1.2, 2.3, 3.4, 4.5, 5.6, 6.7, 7.8, 8.9, 9.0]])))
#output_schema(NumpyParameterType(np.array([4429.929236457418])))
def run(data):
return [123] #test
Update 2:
Here is a screencap of the endpoint page. Is it normal for the CPU to be .1? Also, when i hit the swagger url in the browser, i get the error: "No ready replicas for service doc-classify-env-service"
Update 3
After finally getting to the container logs, it turns out that it was choking with this error on my score.py
ModuleNotFoundError: No module named 'inference_schema'
I then ran a test that commented out the refs for "input_schema" and "output_schema" and also simplified my pip_packages and the REST endpoint come up! I was also able to get a prediction out of the model.
pip_packages=["azureml-defaults","tensorflow", "keras"])
So my question is, how should I have my pip_packages for the scoring file to utilize the inference_schema decorators? I'm assuming I need to include azureml-sdk[auotml] pip package, but when i do so, the image creation fails and I see several dependency conflicts.

Try retrieving your service from the workspace directly
ws.webservices[service_name].get_logs()
Also, I found deploying an image as an endpoint to be easier than inference+deploy model (depending on your use case)
my_image = Image(ws, name='test', version='26')
service = AksWebservice.deploy_from_image(ws, "test1", my_image, deployment_config, aks_target)

Related

Azure : Error 404: AciDeploymentFailed / Error 400 ACI Service request failed

I am trying to deploy a machine learning model through an ACI (Azure Container Instances) service. I am working in Python and I followed the following code (from the official documentation : https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli) :
The entry script file is the following (score.py):
import os
import dill
import joblib
def init():
global model
# Get the path where the deployed model can be found
model_path = os.getenv('AZUREML_MODEL_DIR')
# Load existing model
model = joblib.load('model.pkl')
# Handle request to the service
def run(data):
try:
# Pick out the text property of the JSON request
# Expected JSON details {"text": "some text to evaluate"}
data = json.loads(data)
prediction = model.predict(data['text'])
return prediction
except Exception as e:
error = str(e)
return error
And the model deployment workflow is as:
from azureml.core import Workspace
# Connect to workspace
ws = Workspace(subscription_id="my-subscription-id",
resource_group="my-ressource-group-name",
workspace_name="my-workspace-name")
from azureml.core.model import Model
model = Model.register(workspace = ws,
model_path= 'model.pkl',
model_name = 'my-model',
description = 'my-description')
from azureml.core.environment import Environment
# Name environment and call requirements file
# requirements: numpy, tensorflow
myenv = Environment.from_pip_requirements(name = 'myenv', file_path = 'requirements.txt')
from azureml.core.model import InferenceConfig
# Create inference configuration
inference_config = InferenceConfig(environment=myenv, entry_script='score.py')
from azureml.core.webservice import AciWebservice #AksWebservice
# Set the virtual machine capabilities
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 0.5, memory_gb = 3)
from azureml.core.model import Model
# Deploy ML model (Azure Container Instances)
service = Model.deploy(workspace=ws,
name='my-service-name',
models=[model],
inference_config=inference_config,
deployment_config=deployment_config)
service.wait_for_deployment(show_output = True)
I succeded once with the previous code. I noticed that during the deployment the Model.deploy created a container registry with a specific name (6e07ce2cc4ac4838b42d35cda8d38616).
The problem:
The API was working well and I wanted to deploy an other model from scratch. I deleted the API service and model from Azure ML Studio and the container registry from Azure ressources.
Unfortunately I am not able to deploy again anything.
Everything goes fine until the last step (the Model.deploy step), I have the following error message :
Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 46243f9b-3833-4650-8d47-3ac54a39dc5e
More information can be found here: https://machinelearnin2812599115.blob.core.windows.net/azureml/ImageLogs/46245f8b-3833-4659-8d47-3ac54a39dc5e/build.log?sv=2019-07-07&sr=b&sig=45kgNS4sbSZrQH%2Fp29Rhxzb7qC5Nf1hJ%2BLbRDpXJolk%3D&st=2021-10-25T17%3A20%3A49Z&se=2021-10-27T01%3A24%3A49Z&sp=r
Error:
{
"code": "AciDeploymentFailed",
"statusCode": 404,
"message": "No definition exists for Environment with Name: myenv Version: Autosave_2021-10-25T17:24:43Z_b1d066bf Reason: Container > registry 6e07ce2cc4ac4838b42d35cda8d38616.azurecr.io not found. If private link is enabled in workspace, please verify ACR is part of private > link and retry..",
"details": []
}
I do not understand why the first time a new container registry was well created, but now it seems that it is sought (the message is saying that container registry identified by name 6e07ce2cc4ac4838b42d35cda8d38616 is missing). I never found where I can force the creation of a new container registry ressource in Python, neither specify a name for it in AciWebservice.deploy_configuration or Model.deploy.
Does anyone could help me moving on with this? The best solution would be I think to delete totally this 6e07ce2cc4ac4838b42d35cda8d38616 container registry but I can't find where the reference is set so Model.deploy always fall to find it.
An other solution would be to force Model.deploy to generate a new container registry, but I could find how to make that.
It's been 2 days that I am on this and I really need your help !
PS : I am not at all a DEVOPS/MLOPS guy, I make data science and good models, but infrastructure and deployment is not really my thing so please be gentle on this part ! :-)
What I tried
Creating the container registry with same name
I tried to create the container registry by hand, but this time, this is the container that cannot be created. The Python output of the Model.deploy is the following :
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-10-25 19:25:10+02:00 Creating Container Registry if not exists.
2021-10-25 19:25:10+02:00 Registering the environment.
2021-10-25 19:25:13+02:00 Building image..
2021-10-25 19:30:45+02:00 Generating deployment configuration.
2021-10-25 19:30:46+02:00 Submitting deployment to compute.
Failed
Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 93780de6-7662-40d8-ab9e-4e1556ef880f
Current sub-operation type not known, more logs unavailable.
Error:
{
"code": "InaccessibleImage",
"statusCode": 400,
"message": "ACI Service request failed. Reason: The image '6e07ce2cc4ac4838b42d35cda8d38616.azurecr.io/azureml/azureml_684133370d8916c87f6230d213976ca5' in container group 'my-service-name-LM4HbqzEBEi0LTXNqNOGFQ' is not accessible. Please check the image and registry credential.. Refer to https://learn.microsoft.com/azure/container-registry/container-registry-authentication#admin-account and make sure Admin user is enabled for your container registry."
}
Setting admin user enabled
I tried to follow the recommandation of the last message saying to set Admin user enabled for the container registry. All what I saw in Azure interface is that a username and password appeared when enabling on user admin.
Unfortunately the same error message appears again if I try to relaunche my code and I am stucked here...
Changing name of the environment and model
This does not produces any change. Same errors.

As you tried with first attempt it was worked. After deleting the API service and model from Azure ML Studio and the container registry from Azure resources you are not able to redeploy again.
My assumption is your first attempt you are already register the Model Environment variable. So when you try to reregister by using the same model name while deploying it will gives you the error.
Thanks # anders swanson Your solution worked for me.
If you have already registered your env, myenv, and none of the details of the your environment have changed, there is no need re-register it with myenv.register(). You can simply get the already register env using Environment.get() like so:
myenv = Environment.get(ws, name='myenv', version=11)
My Suggestion is to name your environment as new value.
"model_scoring_env". Register it once, then pass it to the InferenceConfig.
Refer here

Azure function failing : "statusCode": 413, "message": "request entity too large"

I have an Python script that creates a CSV file and loads it into an Azure Container. I want to run it as an Azure Function, but it is failing once I deploy it to Azure.
The code works fine in Google Colab (code here minus the connection string). It also works fine when I run it locally as an Azure function via the CLI (func start command).
Here is the code from the init.py file that is deployed
import logging
import azure.functions as func
import numpy as np
import pandas as pd
from datetime import datetime
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
import tempfile
def main(mytimer: func.TimerRequest) -> None:
logging.info('Python trigger function.')
temp_path = tempfile.gettempdir()
dateTimeObj = datetime.now()
timestampStr = dateTimeObj.strftime("%d%b%Y%H%M%S")
filename =f"{timestampStr}.csv"
df = pd.DataFrame(np.random.randn(5, 3), columns=['Column1','Column2','Colum3'])
df.to_csv(f"{temp_path}{filename}", index=False)
blob = BlobClient.from_connection_string(
conn_str="My connection string",
container_name="container2",
blob_name=filename)
with open(f"{temp_path}{filename}", "rb") as data:
blob.upload_blob(data)
The deployment to Azure is successful, but the functions fails in Azure. When I look in the Azure Function portal, the output from the Code+ Test menu says
{
"statusCode": 413,
"message": "request entity too large"
}
The CSV file produced by the script is 327B - so tiny. Besides that error message I can't see any good information on what is causing the failure. Can anyone suggest a solution / way forward?
Here is the requirements file contents
# Do not include azure-functions-worker as it may conflict with the Azure Functions platform
azure-functions
azure-storage-blob==12.8.1
logging
numpy==1.19.3
pandas==1.3.0
DateTime==4.3
backports.tempfile==1.0
azure-storage-blob==12.8.1
Here is what I have tried.
This article suggests this issue is linked to CORS (Cross Origin Resource sharing) issue. However adding : https://functions.azure.com to my CORS allowed domains in the the Resource sharing menu in the storage settings of my storage account didn't solve the problem (I also republished the Azure function).
Any help, or suggestions would be appreciated.

I tried reproducing the issue with all the information which you have provided and got the same 404 error as below:
Later after adding a value * in CORS, we can get rid of this issue, below is the fixed screenshot and CORS values:
CORS:
Test/Run:
Also we can add the following domain names to avoid 404 error
https://functions.azure.com
https://functions-staging.azure.com
https://functions.azure.com
This will also fix the issue.

MLFlow Model Registry ENDPOINT_NOT_FOUND: No API found for ERROR

I'm currently using MLFlow in Azure Databricks and trying to load a model from the Model Registry. Currently referencing the version, but will want to reference the stage 'Production' (I get the same error when referencing the stage as well)
I keep encountering an error:
ENDPOINT_NOT_FOUND: No API found for 'POST /mlflow/model-versions/get-download-uri'
My artifacts are stored in the dbfs filestore.
I have not been able to identify why this is happening.
Code:
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
import mlflow.pyfunc
model_name = "model_name"
model_version_uri = "models:/{model_name}/4".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_version_uri))
model_version_4 = mlflow.pyfunc.load_model(model_version_uri)
model_production_uri = "models:/{model_name}/production".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_production_uri))
model_production = mlflow.pyfunc.load_model(model_production_uri)

How can I deploy a re-trained Sagemaker model to an endpoint?

With an sagemaker.estimator.Estimator, I want to re-deploy a model after retraining (calling fit with new data).
When I call this
estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')
I get an error
botocore.exceptions.ClientError: An error occurred (ValidationException)
when calling the CreateEndpoint operation:
Cannot create already existing endpoint "arn:aws:sagemaker:eu-east-
1:1776401913911:endpoint/zyx".
Apparently I want to use functionality like UpdateEndpoint. How do I access that functionality from this API?

Yes, under the hood the model.deploy creates a model, an endpoint configuration and an endpoint. When you call again the method from an already-deployed, trained estimator it will create an error because a similarly-configured endpoint is already deployed. What I encourage you to try:
use the update_endpoint=True parameter. From the SageMaker SDK doc:
"Additionally, it is possible to deploy a different endpoint configuration, which links to your model, to an already existing
SageMaker endpoint. This can be done by specifying the existing
endpoint name for the endpoint_name parameter along with the
update_endpoint parameter as True within your deploy() call."
Alternatively, if you want to create a separate model you can specify a new model_name in your deploy

update_endpoint has been deprecated since AFAIK . To re-create the UpdateEndpoint functionality from this API itself and deploy a newly fit training job to an existing endpoint , we could do something like this (this example uses the sagemaker sklearn API):
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn(
entry_point=model.py,
instance_type=<instance_type>,
framework_version=<framework_version>,
role=<role>,
dependencies=[
<comma seperated names of files>
],
hyperparameters={
'key_1':value,
'key_2':value,
...
}
)
sklearn_estimator.fit()
sm_client = boto3.client('sagemaker')
# Create the model
sklearn_model = sklearn_estimator.create_model()
# Define an endpoint config and an endpoint
endpoint_config_name = 'endpoint-' + datetime.utcnow().strftime("%Y%m%d%H%m%s")
current_endpoint = endpoint_config_name
# From the Model : create the endpoint config and the endpoint
sklearn_model.deploy(
initial_instance_count=<count>,
instance_type=<instance_type>,
endpoint_name=current_endpoint
)
# Update the existing endpoint if it exists or create a new one
try:
sm_client.update_endpoint(
EndpointName=DESIRED_ENDPOINT_NAME, # The Prod/Existing Endpoint Name
EndpointConfigName=endpoint_config_name
)
except Exception as e:
try:
sm_client.create_endpoint(
EndpointName=DESIRED_ENDPOINT_NAME, # The Prod Endpoint name
EndpointConfigName=endpoint_config_name
)
except Exception as e:
logger.info(e)
sm_client.delete_endpoint(EndpointName=current_endpoint)

Cannot deploy a trained model to an existing AKS compute target

I have a model that was trained on a Machine Learning Compute on Azure Machine Learning Service. The registered model already lives my workspace and I would like to deploy it to a pre-existing AKS instance that I previously provisioned in my workspace. I am able to successfully configure and register the container image:
# retrieve cloud representations of the models
rf = Model(workspace=ws, name='pumps_rf')
le = Model(workspace=ws, name='pumps_le')
ohc = Model(workspace=ws, name='pumps_ohc')
print(rf); print(le); print(ohc)
<azureml.core.model.Model object at 0x7f66ab3b1f98>
<azureml.core.model.Model object at 0x7f66ab7e49b0>
<azureml.core.model.Model object at 0x7f66ab85e710>
package_list = [
'category-encoders==1.3.0',
'numpy==1.15.0',
'pandas==0.24.1',
'scikit-learn==0.20.2']
# Conda environment configuration
myenv = CondaDependencies.create(pip_packages=package_list)
conda_yml = 'file:'+os.getcwd()+'/myenv.yml'
with open(conda_yml,"w") as f:
f.write(myenv.serialize_to_string())
Configuring and registering the image works:
# Image configuration
image_config = ContainerImage.image_configuration(execution_script='score.py',
runtime='python',
conda_file='myenv.yml',
description='Pumps Random Forest model')
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
Creating image
Running....................
SucceededImage creation operation finished for image pumpsrfimage:2, operation "Succeeded"
Attaching to an existing cluster also works:
# Attach the cluster to your workgroup
attach_config = AksCompute.attach_configuration(resource_group = Config.RESOURCE_GROUP,
cluster_name = Config.DEPLOY_COMPUTE)
aks_target = ComputeTarget.attach(workspace=ws,
name=Config.DEPLOY_COMPUTE,
attach_configuration=attach_config)
# Wait for the operation to complete
aks_target.wait_for_completion(True)
SucceededProvisioning operation finished, operation "Succeeded"
However, when I try to deploy the image to the existing cluster, it fails with a WebserviceException.
# Set configuration and service name
aks_config = AksWebservice.deploy_configuration()
# Deploy from image
service = Webservice.deploy_from_image(workspace = ws,
name = 'pumps-aks-service-1' ,
image = image,
deployment_config = aks_config,
deployment_target = aks_target)
# Wait for the deployment to complete
service.wait_for_deployment(show_output = True)
print(service.state)
WebserviceException: Unable to create service with image pumpsrfimage:1 in non "Succeeded" creation state.
---------------------------------------------------------------------------
WebserviceException Traceback (most recent call last)
<command-201219424688503> in <module>()
7 image = image,
8 deployment_config = aks_config,
----> 9 deployment_target = aks_target)
10 # Wait for the deployment to complete
11 service.wait_for_deployment(show_output = True)
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/webservice.py in deploy_from_image(workspace, name, image, deployment_config, deployment_target)
284 return child._deploy(workspace, name, image, deployment_config, deployment_target)
285
--> 286 return deployment_config._webservice_type._deploy(workspace, name, image, deployment_config, deployment_target)
287
288 #staticmethod
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/aks.py in _deploy(workspace, name, image, deployment_config, deployment_target)
Any ideas on how to solve this issue? I am writing the code in a Databricks notebook. Also, I am able to create and deploy the cluster using Azure Portal no problem so this appears to be an issue with my code/Python SDK or the way Databricks works with AMLS.
UPDATE:
I was able to deploy my image to AKS using Azure Portal and the webservice works as expected. This means the issue lies somewhere between Databricks, the Azureml Python SDK and Machine Learning Service.
UPDATE 2:
I'm working with Microsoft to fix this issue. Will report back once we have a solution.

In my initial code, when creating the image, I was not using:
image.wait_for_creation(show_output=True)
As a consequence, I was calling CreateImage and DeployImage before the image was created which errored out. Can't believe it was that simple..
UPDATED IMAGE CREATION SNIPPET:
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
image.wait_for_creation(show_output=True)

From personal experience I would say that the error message you see might suggest that there is some error with the script inside the image. Such errors doesn't necessary prevent the image from being created successfully, but it might prevent the image from being used in a service. However, if you have successfully been able to deploy the image in other services, then you should be able to rule out this option.
You can follow this guide for more information on how to debug the Docker image locally, as well as finding logs and other useful information.

Agreed on Arvid's answer. Were you able to succesfully run it? You can also try and deploy it to ACI, but if the problem is in the score.py, you'd have the same issue but it's quick to try. Also, a bit more painful but if you want to debug the deployment, but you can expose port tcp 5678 on your local docker deployment and use VSCode and PTVSD to connect to it and debug step by step.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Azure ML: how to access logs of a failed Model deployment - azure-machine-learning-service

Related

Azure : Error 404: AciDeploymentFailed / Error 400 ACI Service request failed

Azure function failing : "statusCode": 413, "message": "request entity too large"

MLFlow Model Registry ENDPOINT_NOT_FOUND: No API found for ERROR

How can I deploy a re-trained Sagemaker model to an endpoint?

Cannot deploy a trained model to an existing AKS compute target

Categories

Resources