Azure-ML Deployment does NOT see AzureML Environment (wrong version number) - azure

I've followed the documentation pretty well as outlined here.
I've setup my azure machine learning environment the following way:
from azureml.core import Workspace
# Connect to the workspace
ws = Workspace.from_config()
from azureml.core import Environment
from azureml.core import ContainerRegistry
myenv = Environment(name = "myenv")
myenv.inferencing_stack_version = "latest" # This will install the inference specific apt packages.
# Docker
myenv.docker.enabled = True
myenv.docker.base_image_registry.address = "myazureregistry.azurecr.io"
myenv.docker.base_image_registry.username = "myusername"
myenv.docker.base_image_registry.password = "mypassword"
myenv.docker.base_image = "4fb3..."
myenv.docker.arguments = None
# Environment variable (I need python to look at folders
myenv.environment_variables = {"PYTHONPATH":"/root"}
# python
myenv.python.user_managed_dependencies = True
myenv.python.interpreter_path = "/opt/miniconda/envs/myenv/bin/python"
from azureml.core.conda_dependencies import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-defaults")
myenv.python.conda_dependencies=conda_dep
myenv.register(workspace=ws) # works!
I have a score.py file configured for inference (not relevant to the problem I'm having)...
I then setup inference configuration
from azureml.core.model import InferenceConfig
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)
I setup my compute cluster:
from azureml.core.compute import ComputeTarget, AksCompute
from azureml.exceptions import ComputeTargetException
# Choose a name for your cluster
aks_name = "theclustername"
# Check to see if the cluster already exists
try:
aks_target = ComputeTarget(workspace=ws, name=aks_name)
print('Found existing compute target')
except ComputeTargetException:
print('Creating a new compute target...')
prov_config = AksCompute.provisioning_configuration(vm_size="Standard_NC6_Promo")
aks_target = ComputeTarget.create(workspace=ws, name=aks_name, provisioning_configuration=prov_config)
aks_target.wait_for_completion(show_output=True)
from azureml.core.webservice import AksWebservice
# Example
gpu_aks_config = AksWebservice.deploy_configuration(autoscale_enabled=False,
num_replicas=3,
cpu_cores=4,
memory_gb=10)
Everything succeeds; then I try and deploy the model for inference:
from azureml.core.model import Model
model = Model(ws, name="thenameofmymodel")
# Name of the web service that is deployed
aks_service_name = 'tryingtodeply'
# Deploy the model
aks_service = Model.deploy(ws,
aks_service_name,
models=[model],
inference_config=inference_config,
deployment_config=gpu_aks_config,
deployment_target=aks_target,
overwrite=True)
aks_service.wait_for_deployment(show_output=True)
print(aks_service.state)
And it fails saying that it can't find the environment. More specifically, my environment version is version 11, but it keeps trying to find an environment with a version number that is 1 higher (i.e., version 12) than the current environment:
FailedERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: 0f03a025-3407-4dc1-9922-a53cc27267d4
More information can be found here:
Error:
{
"code": "BadRequest",
"statusCode": 400,
"message": "The request is invalid",
"details": [
{
"code": "EnvironmentDetailsFetchFailedUserError",
"message": "Failed to fetch details for Environment with Name: myenv Version: 12."
}
]
}
I have tried to manually edit the environment JSON to match the version that azureml is trying to fetch, but nothing works. Can anyone see anything wrong with this code?
Update
Changing the name of the environment (e.g., my_inference_env) and passing it to InferenceConfig seems to be on the right track. However, the error now changes to the following
Running..........
Failed
ERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: f0dfc13b-6fb6-494b-91a7-de42b9384692
More information can be found here: https://some_long_http_address_that_leads_to_nothing
Error:
{
"code": "DeploymentFailed",
"statusCode": 404,
"message": "Deployment not found"
}
Solution
The answer from Anders below is indeed correct regarding the use of azure ML environments. However, the last error I was getting was because I was setting the container image using the digest value (a sha) and NOT the image name and tag (e.g., imagename:tag). Note the line of code in the first block:
myenv.docker.base_image = "4fb3..."
I reference the digest value, but it should be changed to
myenv.docker.base_image = "imagename:tag"
Once I made that change, the deployment succeeded! :)

One concept that took me a while to get was the bifurcation of registering and using an Azure ML Environment. If you have already registered your env, myenv, and none of the details of the your environment have changed, there is no need re-register it with myenv.register(). You can simply get the already register env using Environment.get() like so:
myenv = Environment.get(ws, name='myenv', version=11)
My recommendation would be to name your environment something new: like "model_scoring_env". Register it once, then pass it to the InferenceConfig.

Related

Input format for Tensorflow models on GCP AI Platform

I have a uploaded a model to GCP AI Platform Models. It's a simple Keras, Multistep Model, with 5 features trained on 168 lagged values. When I am trying to test the models in, I'm getting this strange error message:
"error": "Prediction failed: Error during model execution: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.FAILED_PRECONDITION\n\tdetails = \"Error while reading resource variable dense_7/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/dense_7/bias)\n\t [[{{node model_2/dense_7/BiasAdd/ReadVariableOp}}]]\"\n\tdebug_error_string = \"{\"created\":\"#1618946146.138507164\",\"description\":\"Error received from peer ipv4:127.0.0.1:8081\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1061,\"grpc_message\":\"Error while reading resource variable dense_7/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/dense_7/bias)\\n\\t [[{{node model_2/dense_7/BiasAdd/ReadVariableOp}}]]\",\"grpc_status\":9}\"\n>"
The input is on the following format, a list ((1, 168, 5))
See below of example:
{
"instances":
[[[ 3.10978284e-01, 2.94650396e-01, 8.83664149e-01,
1.60210423e+00, -1.47402699e+00],
[ 3.10978284e-01, 2.94650396e-01, 5.23466315e-01,
1.60210423e+00, -1.47402699e+00],
[ 8.68576328e-01, 7.78699823e-01, 2.83334426e-01,
1.60210423e+00, -1.47402699e+00]]]
}

Unable to import google logging metric using terraform

I have created in terraform the following logging metric resource
resource "google_logging_metric" "proservices_run" {
name = "user/proservices-run"
filter = "resource.type=gae_app AND severity>=ERROR"
project = "${google_project.service.project_id}"
metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}
}
I have also on Stackdriver a custom metric named user/proservices-run.
However the following two import attempts fail:
$ terraform import google_logging_metric.proservices_run proservices-run
google_logging_metric.proservices_run: Importing from ID "proservices-run"...
google_logging_metric.proservices_run: Import complete!
Imported google_logging_metric (ID: proservices-run)
google_logging_metric.proservices_run: Refreshing state... (ID: proservices-run)
Error: google_logging_metric.proservices_run (import id: proservices-run): 1 error occurred:
* import google_logging_metric.proservices_run result: proservices-run: google_logging_metric.proservices_run: project: required field is not set
$ terraform import google_logging_metric.proservices_run user/proservices-run
google_logging_metric.proservices_run: Importing from ID "user/proservices-run"...
google_logging_metric.proservices_run: Import complete!
Imported google_logging_metric (ID: user/proservices-run)
google_logging_metric.proservices_run: Refreshing state... (ID: user/proservices-run)
Error: google_logging_metric.proservices_run (import id: user/proservices-run): 1 error occurred:
* import google_logging_metric.proservices_run result: user/proservices-run: google_logging_metric.proservices_run: project: required field is not set
Using
Terraform v0.11.14
and
provider.google = 2.11.0
provider.google-beta 2.11.0
edit: I noticed the project: required field is not set in the error message, I added the field project in my TF code, however the outcome is still the same.
I ran into the same issue trying to import a log-based metrics.
The solution was to set the env-var GOOGLE_PROJECT=<your-project-id> when running the command.
GOOGLE_PROJECT=MyProjectId \
terraform import \
"google_logging_metric.create_user_count" \
"create_user_count"

Unable to build local AMLS environment with private wheel

I am trying to write a small program using the AzureML Python SDK (v1.0.85) to register an Environment in AMLS and use that definition to construct a local Conda environment when experiments are being run (for a pre-trained model). The code works fine for simple scenarios where all dependencies are loaded from Conda/ public PyPI, but when I introduce a private dependency (e.g. a utils library) I am getting a InternalServerError with the message "Error getting recipe specifications".
The code I am using to register the environment is (after having authenticated to Azure and connected to our workspace):
environment_name = config['environment']['name']
py_version = "3.7"
conda_packages = ["pip"]
pip_packages = ["azureml-defaults"]
private_packages = ["./env-wheels/utils-0.0.3-py3-none-any.whl"]
print(f"Creating environment with name {environment_name}")
environment = Environment(name=environment_name)
conda_deps = CondaDependencies()
print(f"Adding Python version: {py_version}")
conda_deps.set_python_version(py_version)
for conda_pkg in conda_packages:
print(f"Adding Conda denpendency: {conda_pkg}")
conda_deps.add_conda_package(conda_pkg)
for pip_pkg in pip_packages:
print(f"Adding Pip dependency: {pip_pkg}")
conda_deps.add_pip_package(pip_pkg)
for private_pkg in private_packages:
print(f"Uploading private wheel from {private_pkg}")
private_pkg_url = Environment.add_private_pip_wheel(workspace=ws, file_path=Path(private_pkg).absolute(), exist_ok=True)
print(f"Adding private Pip dependency: {private_pkg_url}")
conda_deps.add_pip_package(private_pkg_url)
environment.python.conda_dependencies = conda_deps
environment.register(workspace=ws)
And the code I am using to create the local Conda environment is:
amls_environment = Environment.get(ws, name=environment_name, version=environment_version)
print(f"Building environment...")
amls_environment.build_local(workspace=ws)
The exact error message being returned when build_local(...) is called is:
Traceback (most recent call last):
File "C:\Anaconda\envs\AMLSExperiment\lib\site-packages\azureml\core\environment.py", line 814, in build_local
raise error
File "C:\Anaconda\envs\AMLSExperiment\lib\site-packages\azureml\core\environment.py", line 807, in build_local
recipe = environment_client._get_recipe_for_build(name=self.name, version=self.version, **payload)
File "C:\Anaconda\envs\AMLSExperiment\lib\site-packages\azureml\_restclient\environment_client.py", line 171, in _get_recipe_for_build
raise Exception(message)
Exception: Error getting recipe specifications. Code: 500
: {
"error": {
"code": "ServiceError",
"message": "InternalServerError",
"detailsUri": null,
"target": null,
"details": [],
"innerError": null,
"debugInfo": null
},
"correlation": {
"operation": "15043e1469e85a4c96a3c18c45a2af67",
"request": "19231be75a2b8192"
},
"environment": "westeurope",
"location": "westeurope",
"time": "2020-02-28T09:38:47.8900715+00:00"
}
Process finished with exit code 1
Has anyone seen this error before or able to provide some guidance around what the issue may be?
The issue was with out firewall blocking the required requests between AMLS and the storage container (I presume to get the environment definitions/ private wheels).
We resolved this by updating the firewall with appropriate ALLOW rules for the AMLS service to contact and read from the attached storage container.
Assuming that you'd like to run in the script on a remote compute, then my suggestion would be to pass the environment you just "got". to a RunConfiguration, then pass that to an ScriptRunConfig, Estimator, or a PythonScriptStep
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
src = ScriptRunConfig(source_directory=project_folder, script='train.py')
# Set compute target to the one created in previous step
src.run_config.target = cpu_cluster.name
# Set environment
amls_environment = Environment.get(ws, name=environment_name, version=environment_version)
src.run_config.environment = amls_environment
run = experiment.submit(config=src)
run
Check out the rest of the notebook here.
If you're looking for a local run this notebook might help.

Unable to download terraform modules from azure repo (Private repo)

My terraform-modules repo location is like this:
https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster
I have three directories/modules at root level, namely compute, resourcegroup and sqlserver.
However, when I run terraform init. terraform is unable to download the required modules.
main.tf
module "app_vms" {
source = "https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster"
rg_name = var.resource_group_name
location = module.resource_group.external_rg_location
vnet_name = var.virtual_network_name
subnet_name = var.sql_subnet_name
app_nsg = var.application_nsg
vm_count = var.count_vm
base_hostname = var.app_host_basename
sto_acc_suffix = var.storage_account_suffix
vm_size = var.virtual_machine_size
vm_publisher = var.virtual_machine_image_publisher
vm_offer = var.virtual_machine_image_offer
vm_sku = var.virtual_machine_image_sku
vm_img_version = var.virtual_machine_image_version
username = var.username
password = var.password
allowed_source_ips = var.ip_list
}
module "resource_group" {
source = "https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fresourcegroup&version=GBmaster"
rg_name = "test_rg"
}
module "azure_paas_sqlserver" {
source = "https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fsqlserver&version=GBmaster"
}
It gives me a series of errors like below:(sample only give not all the errors as they are same)
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"https://teamabc.visualstudio.com/dummpproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster":
error downloading
'https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster':
no source URL was returned
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster":
error downloading
'https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster':
no source URL was returned
I tried to remove https:// part but no luck. The repo does require username and password to login.
Wondering if I should be making a public repo in github? but push within the organization is to use Azure Repos.
Post First comment
Thanks for the lead, I did tried but still no charm.
My source url now looks like below
source = "git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster"
I get error below:
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster":
error downloading
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster':
/usr/bin/git exited with 128: Cloning into '.terraform/modules/sql_vms'...
fatal: repository
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster/'
not found
Here:
teamabc.visuastudio.com is the parent azure devops url
dummyproject is the project name
After Charles Response
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git":
error downloading
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git':
/usr/bin/git exited with 128: Cloning into '.terraform/modules/sql_vms'...
fatal: repository
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git/'
not found
You can take a look at Generic Git Repository, the URL should be a Git URL. And finally, it should like this:
source = "git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git"
Or you can select a branch from your Git Repository like this:
source = "git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git?ref=<branch>"
Finally, got it working by below command:
git::https://<PAT TOKEN>#<Azure DevOps URL>/DefaultCollection/<PROJECT NAME>/_git/<REPO NAME>//<sub directory>

Authentication error using new Pulumi azuread module

I've installed the latest Pulumi azuread module and I have this error when I try a pulumi preview:
Previewing update (int):
Type Name Plan Info
pulumi:pulumi:Stack test-int
└─ azuread:index:Application test 1 error
Diagnostics:
azuread:index:Application (test):
error: Error obtaining Authorization Token from the Azure CLI: Error waiting for the Azure CLI: exit status 1
my index.ts is very basic:
import * as pulumi from "#pulumi/pulumi";
import * as azure from "#pulumi/azure";
import * as azuread from "#pulumi/azuread";
const projectName = pulumi.getProject();
const stack = pulumi.getStack();
const config = new pulumi.Config(projectName);
const baseName = `${projectName}-${stack}`;
const testRg = new azure.core.ResourceGroup(baseName, {
name: baseName
});
const test = new azuread.Application("test", {
availableToOtherTenants: false,
homepage: "https://homepage",
identifierUris: ["https://uri"],
oauth2AllowImplicitFlow: true,
replyUrls: ["https://replyurl"],
type: "webapp/api",
});
Creating resources and AD application with the old module azure.ad works fine.
I have no clue what I am missing now....
EDIT:
index.ts the old way
import * as pulumi from "#pulumi/pulumi";
import * as azure from "#pulumi/azure";
const projectName = pulumi.getProject();
const stack = pulumi.getStack();
const config = new pulumi.Config(projectName);
const baseName = `${projectName}-${stack}`;
const testRg = new azure.core.ResourceGroup(baseName, {
name: baseName
});
const test = new azure.ad.Application("test", {
homepage: "https://homepage",
availableToOtherTenants: false,
identifierUris: ["https://uri"],
oauth2AllowImplicitFlow: true,
replyUrls: ["https://replyurl"]
});
Result of pulumi preview:
Previewing update (int):
Type Name Plan Info
pulumi:pulumi:Stack test-int
+ └─ azure:ad:Application test create 1 warning
Diagnostics:
azure:ad:Application (test):
warning: urn:pulumi:int::test::azure:ad/application:Application::test verification warning: The Azure Active Directory resources have been split out into their own Provider.
Information on migrating to the new AzureAD Provider can be found here: https://terraform.io/docs/providers/azurerm/guides/migrating-to-azuread.html
As such the Azure Active Directory resources within the AzureRM Provider are now deprecated and will be removed in v2.0 of the AzureRM Provider.
Resources:
+ 1 to create
2 unchanged
EDIT 2:
I'm running this on Windows 10:
az cli = 2.0.68
pulumi cli = 0.17.22
#pulumi/azure = 0.19.2
#pulumi/azuread = 0.18.2
#pulumi/pulumi = 0.17.21
Here are my principal permissions for Azure Active Directory Graph:
And the permissions for Microsoft Graph:
I ran into this issue and after hours I realized Fiddler was somehow interfering with the Az CLI running

Resources