POST request fails with large data send to model deployed on Azure Container - azure-machine-learning-service

Summary
I have a PyTorch model deployed on an Azure Container instance via the Azure Machine Learning Service SDK. The model takes (large) images for classification in standard numpy formatting.
It seems, I'm hitting a HTTP request size limit on the server side. Requests to the model succeeds with PNG images of a size in the 8-9mb range and fails with images of the 15mb+ size. Specifically, it fails with 413 Request Entity Too Large.
I assume, the limit is set in Nginx in the Docker image being build, as part of the deployment process. My question: Given that the issue is due to the HTTP request size limit, is there any way to increase this limit in the azureml API?
Deployment process
The deployment process succeeds as expected.
from azureml.core import Workspace
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice, Webservice
from azureml.exceptions import WebserviceException
from pathlib import Path
PATH = Path('/data/home/azureuser/my_project')
ws = Workspace.from_config()
model = ws.models['my_pytorch_model']
inference_config = InferenceConfig(source_directory=PATH/'src',
runtime='python',
entry_script='deployment/scoring/scoring.py',
conda_file='deployment/environment/env.yml')
deployment_config = AciWebservice.deploy_configuration(cpu_cores=2, memory_gb=4)
aci_service_name = 'azure-model'
try:
service = Webservice(ws, name=aci_service_name)
if service:
service.delete()
except WebserviceException as e:
print()
service = Model.deploy(ws, aci_service_name, [model], inference_config, deployment_config)
service.wait_for_deployment(True)
print(service.state)
Testing via requests
A simple test using requests:
import os
import json
import numpy as np
import requests
from PIL import Image as PilImage
test_data = np.array(PilImage.open(PATH/'src/deployment/test/test_image.png')).tolist()
test_sample = json.dumps({'raw_data':
test_data
})
test_sample_encoded = bytes(test_sample, encoding='utf8')
headers = {
'Content-Type': 'application/json'
}
response = requests.post(
service.scoring_uri,
data=test_sample_encoded,
headers=headers,
verify=True,
timeout=10
)
Produces the following error in requests for larger files:
ConnectionError: ('Connection aborted.', BrokenPipeError(32, 'Broken pipe'))
Which I guess is a known error in requests, when a connection is closed from the server before data upload is completed.
Testing via pycurl
Using the curl wrapper, I get a more interpretable response.
import pycurl
from io import BytesIO
c = pycurl.Curl()
b = BytesIO()
c.setopt(c.URL, service.scoring_uri)
c.setopt(c.POST, True)
c.setopt(c.HTTPHEADER,['Content-Type: application/json'])
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(c.POSTFIELDS, test_sample)
c.setopt(c.VERBOSE, True)
c.perform()
out = b.getvalue()
b.close()
c.close()
print(out)
For large files, this yields the following error:
<html>
<head>
<title>
413 Request Entity Too Large
</title>
</head>
<body bgcolor="white">
<center>
<h1>
413 Request Entity Too Large
</h1>
</center>
<hr>
<center>
nginx/1.10.3 (Ubuntu)
</center>
</body>
</html>
Leading me to believe this is an issue in the Nginx configuration. Specifically, I guess that client_max_body_size is set to 10mb.
Question summarised
Given that I am indeed hitting an issue with the Nginx configuration, can I change it somehow? If not using the Azure Machine Learning Service SDK, then maybe by overwriting the /etc/nginx/nginx.conf file?

In the built images, client_max_body_size in nginx configuration is set to 100m. Changing nginx configuration is not supported.

For scoring large data, you might want to look into different architecture. Passing large amounts of raw data in HTTP request body is not the most efficient way.
For example, you can build a scoring pipeline. See this notebook for example: Pipeline Style
Transfer
Alternatively, if you want to use ACI web service, you could stage the images to a blob store, and then pass the images by reference to blob location.

Related

Azure function failing : "statusCode": 413, "message": "request entity too large"

I have an Python script that creates a CSV file and loads it into an Azure Container. I want to run it as an Azure Function, but it is failing once I deploy it to Azure.
The code works fine in Google Colab (code here minus the connection string). It also works fine when I run it locally as an Azure function via the CLI (func start command).
Here is the code from the init.py file that is deployed
import logging
import azure.functions as func
import numpy as np
import pandas as pd
from datetime import datetime
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
import tempfile
def main(mytimer: func.TimerRequest) -> None:
logging.info('Python trigger function.')
temp_path = tempfile.gettempdir()
dateTimeObj = datetime.now()
timestampStr = dateTimeObj.strftime("%d%b%Y%H%M%S")
filename =f"{timestampStr}.csv"
df = pd.DataFrame(np.random.randn(5, 3), columns=['Column1','Column2','Colum3'])
df.to_csv(f"{temp_path}{filename}", index=False)
blob = BlobClient.from_connection_string(
conn_str="My connection string",
container_name="container2",
blob_name=filename)
with open(f"{temp_path}{filename}", "rb") as data:
blob.upload_blob(data)
The deployment to Azure is successful, but the functions fails in Azure. When I look in the Azure Function portal, the output from the Code+ Test menu says
{
"statusCode": 413,
"message": "request entity too large"
}
The CSV file produced by the script is 327B - so tiny. Besides that error message I can't see any good information on what is causing the failure. Can anyone suggest a solution / way forward?
Here is the requirements file contents
# Do not include azure-functions-worker as it may conflict with the Azure Functions platform
azure-functions
azure-storage-blob==12.8.1
logging
numpy==1.19.3
pandas==1.3.0
DateTime==4.3
backports.tempfile==1.0
azure-storage-blob==12.8.1
Here is what I have tried.
This article suggests this issue is linked to CORS (Cross Origin Resource sharing) issue. However adding : https://functions.azure.com to my CORS allowed domains in the the Resource sharing menu in the storage settings of my storage account didn't solve the problem (I also republished the Azure function).
Any help, or suggestions would be appreciated.
I tried reproducing the issue with all the information which you have provided and got the same 404 error as below:
Later after adding a value * in CORS, we can get rid of this issue, below is the fixed screenshot and CORS values:
CORS:
Test/Run:
Also we can add the following domain names to avoid 404 error
https://functions.azure.com
https://functions-staging.azure.com
https://functions.azure.com
This will also fix the issue.

MLFlow Model Registry ENDPOINT_NOT_FOUND: No API found for ERROR

I'm currently using MLFlow in Azure Databricks and trying to load a model from the Model Registry. Currently referencing the version, but will want to reference the stage 'Production' (I get the same error when referencing the stage as well)
I keep encountering an error:
ENDPOINT_NOT_FOUND: No API found for 'POST /mlflow/model-versions/get-download-uri'
My artifacts are stored in the dbfs filestore.
I have not been able to identify why this is happening.
Code:
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
import mlflow.pyfunc
model_name = "model_name"
model_version_uri = "models:/{model_name}/4".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_version_uri))
model_version_4 = mlflow.pyfunc.load_model(model_version_uri)
model_production_uri = "models:/{model_name}/production".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_production_uri))
model_production = mlflow.pyfunc.load_model(model_production_uri)

Lambda function gets stuck when calling RDS via SQLalchemy URI

I have a fast API application. Initially, I was passing my DB URI via ngrok tunnel like this in my SAM template. In this setup Lambda will be using my local machine's PSQL DB.
DbConnnectionString:
Type: String
Default: postgresql://<uname>:<pwd>#x.tcp.ngrok.io:PORT/DB
This is how I read the URI in my Python code
# config.py
DATABASE_URL = os.environ.get('DB_URI')
db_engine = create_engine(DATABASE_URL)
db_session = sessionmaker(autocommit=False, autoflush=False,bind=db_engine)
print(f"Configs initialized for {API_V1_STR}")
# app.py
# 3rd party
from fastapi import FastAPI
# Custom
from config.app_config import PROJECT_NAME, db_engine
from models.db_models import Base
print("Creating all database")
Base.metadata.create_all(bind=db_engine)
app = FastAPI(title=PROJECT_NAME)
print("APP created")
In this setup, everything seems to work as expected.
But whenever I replace the DB URL with RDS DB, suddenly the call gets stuck at create all database step as shown in the image below. when this happens the lambda always times out and throws exceptions.
If I run the code locally using uvicorn this error doesn't occur.
Everything works as expected.
When I use sam local invoke even with RDS URL, the API call works without any issues.
This problem occurs only while deployed in AWS Lambda.
I notice that configs are initialized twice in this setup, Once before START request ID and once after.
I have tried reading up on it but not clear what could I do to fix this. Any help would be much appreciated.
It was my bad!. I didn't pay attention to security groups. It was a connection timeout all along. Once I fixed the port access in Security groups, lambda started working as expected.

Azure ML: how to access logs of a failed Model deployment

I'm deploying a Keras model that is failing with the error below. The exception says that I can retrieve the logs by running "print(service.get_logs())", but that's giving me empty results. I am deploying the model from my AzureNotebook and I'm using the same "service" var to retrieve the logs.
Also, how can i retrieve the logs from the container instance? I'm deploying to an AKS compute cluster I created. Sadly, the docs link in the exception also doesnt detail how to retrieve these logs.
More information can be found using '.get_logs()' Error:
{ "code":
"KubernetesDeploymentFailed", "statusCode": 400, "message":
"Kubernetes Deployment failed", "details": [
{
"code": "CrashLoopBackOff",
"message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check
the logs for your container instance: my-model-service. From
the AML SDK, you can run print(service.get_logs()) if you have service
object to fetch the logs. \nYou can also try to run image
mlwks.azurecr.io/azureml/azureml_3c0c34b65cf18c8644e8d745943ab7d2:latest
locally. Please refer to http://aka.ms/debugimage#service-launch-fails
for more information."
} ] }
UPDATE
Here's my code to deploy the model:
environment = Environment('my-environment')
environment.python.conda_dependencies = CondaDependencies.create(pip_packages=["azureml-defaults","azureml-dataprep[pandas,fuse]","tensorflow", "keras", "matplotlib"])
service_name = 'my-model-service'
# Remove any existing service under the same name.
try:
Webservice(ws, service_name).delete()
except WebserviceException:
pass
inference_config = InferenceConfig(entry_script='score.py', environment=environment)
comp = ComputeTarget(workspace=ws, name="ml-inference-dev")
service = Model.deploy(workspace=ws,
name=service_name,
models=[model],
inference_config=inference_config,
deployment_target=comp
)
service.wait_for_deployment(show_output=True)
And my score.py
import joblib
import numpy as np
import os
import keras
from keras.models import load_model
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
def init():
global model
model_path = Model.get_model_path('model.h5')
model = load_model(model_path)
model = keras.models.load_model(model_path)
# The run() method is called each time a request is made to the scoring API.
#
# Shown here are the optional input_schema and output_schema decorators
# from the inference-schema pip package. Using these decorators on your
# run() method parses and validates the incoming payload against
# the example input you provide here. This will also generate a Swagger
# API document for your web service.
#input_schema('data', NumpyParameterType(np.array([[0.1, 1.2, 2.3, 3.4, 4.5, 5.6, 6.7, 7.8, 8.9, 9.0]])))
#output_schema(NumpyParameterType(np.array([4429.929236457418])))
def run(data):
return [123] #test
Update 2:
Here is a screencap of the endpoint page. Is it normal for the CPU to be .1? Also, when i hit the swagger url in the browser, i get the error: "No ready replicas for service doc-classify-env-service"
Update 3
After finally getting to the container logs, it turns out that it was choking with this error on my score.py
ModuleNotFoundError: No module named 'inference_schema'
I then ran a test that commented out the refs for "input_schema" and "output_schema" and also simplified my pip_packages and the REST endpoint come up! I was also able to get a prediction out of the model.
pip_packages=["azureml-defaults","tensorflow", "keras"])
So my question is, how should I have my pip_packages for the scoring file to utilize the inference_schema decorators? I'm assuming I need to include azureml-sdk[auotml] pip package, but when i do so, the image creation fails and I see several dependency conflicts.
Try retrieving your service from the workspace directly
ws.webservices[service_name].get_logs()
Also, I found deploying an image as an endpoint to be easier than inference+deploy model (depending on your use case)
my_image = Image(ws, name='test', version='26')
service = AksWebservice.deploy_from_image(ws, "test1", my_image, deployment_config, aks_target)

Crashing of a Cloud Functions HTTP Trigger

I have written a HTTP trigger that takes the ECG database name and record no as the arguments and reads the record from the Cloud Storage, calculates the parameters and writes them to Firestore. I observed a very strange thing, the code crashes without indicating the reason in the console. This is all I get in the console:
Function execution took 58017 ms, finished with status: 'crash'`.
It generally stops when it reads the record from the Cloud Storage. I am using MIT-BIH Cloud Storage to read the records.
from google.cloud import storage
from flask import escape
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore
import numpy as np
import os
from pathlib import Path
from os import listdir
from os.path import isfile, join
from random import randint
def GCFname(request):
recordno = request.args['recordno']
database = request.args['database']
client = storage.Client()
bucket = client.get_bucket('bucket_name')
# it crashes here
record = wfdb.rdrecord(recordno, channels=[0],pb_dir='mitdb')
sig = record.p_signal[:,0]
test_qrs = processing.gqrs_detect(record.p_signal[:,0], fs=record.fs)
ann_test= wfdb.rdann(recordno, 'atr',pb_dir='mitdb')
##Calculate Parameters
cred = credentials.ApplicationDefault()
firebase_admin.initialize_app(cred, {
'projectId': 'project_name',
})
db = firestore.client()
doc_ref = db.collection('xyz').document(database).collection('abc').document(recordno)
doc_ref.set({
u'fieldname': fieldvalue
})
I have deployed using gcloud,
gcloud functions deploy GCFname --runtime python37 --trigger-http --allow-unauthenticated --timeout 540s
But on using the same URL after some time it works. What could be reason for this? Its definitely not timeout issue.
It would be difficult to see why the Cloud Function is crashing without the logs as it can be due to many reasons. Currently, there is a bug open for Cloud Functions crashing before flushing log entries or writing any error trace to Stackdriver. You can follow this issue tracker here for any updates: https://issuetracker.google.com/155215191

Resources