Retrieving current job for Azure ML v2 - azure

Using the v2 Azure ML Python SDK (azure-ai-ml) how do I get an instance of the currently running job?
In v1 (azureml-core) I would do:
from azureml.core import Run
run = Run.get_context()
if isinstance(run, Run):
print("Running on compute...")
What is the equivalent on the v2 SDK?

This is a little more involved in v2 than in was in v1. The reason is that v2 makes a clear distinction between the control plane (where you start/stop your job, deploy compute, etc.) and the data plane (where you run your data science code, load data from storage, etc.).
Jobs can do control plane operations, but they need to do that with a proper identity that was explicitly assigned to the job by the user.
Let me show you the code how to do this first. This script creates an MLClient and then connects to the service using that client in order to retrieve the job's metadata from which it extracts the name of the user that submitted the job:
# control_plane.py
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
import os
def get_ml_client():
uri = os.environ["MLFLOW_TRACKING_URI"]
uri_segments = uri.split("/")
subscription_id = uri_segments[uri_segments.index("subscriptions") + 1]
resource_group_name = uri_segments[uri_segments.index("resourceGroups") + 1]
workspace_name = uri_segments[uri_segments.index("workspaces") + 1]
credential = AzureMLOnBehalfOfCredential()
client = MLClient(
credential=credential,
subscription_id=subscription_id,
resource_group_name=resource_group_name,
workspace_name=workspace_name,
)
return client
ml_client = get_ml_client()
this_job = ml_client.jobs.get(os.environ["MLFLOW_RUN_ID"])
print("This job was created by:", this_job.creation_context.created_by)
As you can see, the code uses a special AzureMLOnBehalfOfCredential to create the MLClient. Options that you would use locally (AzureCliCredential or InteractiveBrowserCredential) won't work for a remote job since you are not authenticated through az login or through the browser prompt on that remote run. For your credentials to be available on the remote job, you need to run the job with user_identity. And you need to retrieve the corresponding credential from the environment by using the AzureMLOnBehalfOfCredential class.
So, how do you run a job with user_identity? Below is the yaml that will achieve it:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
type: command
command: |
pip install azure-ai-ml
python control_plane.py
code: code
environment:
image: library/python:latest
compute: azureml:cpu-cluster
identity:
type: user_identity
Note the identity section at the bottom. Also note that I am lazy and install the azureml-ai-ml sdk as part of the job. In a real setting, I would of course create an environment with the package installed.
These are the valid settings for the identity type:
aml_token: this is the default which will not allow you to access the control plane
managed or managed_identity: this means the job will be run under the given managed identity (aka compute identity). This would be accessed in your job via azure.identity.ManagedIdentityCredential. Of course, you need to provide the chosen compute identity with access to the workspace to be able to read job information.
user_identity: this will run the job under the submitting user's identity. It is to be used with the azure.ai.ml.identity.AzureMLOnBehalfOfCredential credentials as shown above.
So, for your use case, you have 2 options:
You could run the job with user_identity and use the AzureMLOnBehalfOfCredential class to create the MLClient
You could create the compute with a managed identity which you give access to the workspace and then run the job with managed_identity and use the ManagedIdentityCredential class to create the MLClient

Related

Azure Machine Learning compute cluster - avoid using docker?

I would like to use an Azure Machine Learning Compute Cluster as a compute target but do not want it to containerize my project. Is there a way to deactivate this "feature" ?
The main reasons behind this request is that :
I already set up a docker-compose file that is used to specify 3 containers for Apache Airflow and want to avoid a Docker-in-Docker situation. Especially that I already tried to do so but failed so far (here's the link my other related SO question).
I prefer not to use a Compute Instance as it is tied to an Azure account which is not ideal for automation purposes.
Thanks in advance !
Use the provisioning_configuration method of the AmlCompute class to specify configuration parameters.
In the following example, a persistent compute target provisioned by AmlCompute is created. The provisioning_configuration parameter in this example is of type AmlComputeProvisioningConfiguration, which is a child class of ComputeTargetProvisioningConfiguration.
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"
# Verify that cluster does not exist already
try:
cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
max_nodes=4)
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)
Refer - https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute(class)?view=azure-ml-py

prefect.io kubernetes agent and task execution

While reading kubernetes agent documentation, I am getting confused with below line
"Configure a flow-run to run as a Kubernetes Job."
Does it mean that the process which is incharge of submitting flow and communication with api server will run as kubernetes job?
On the other side, the use case which I am trying to solve is
Setup backend server
Execute a flow composed of 2 tasks
if k8s infra available the tasks should be executed as kubernetes jobs
if docker only infra available, the tasks should be executed as docker containers.
Can somebody suggest me, how to solve above scenario in prefect.io?
That's exactly right. When you use KubernetesAgent, Prefect deploys your flow runs as Kubernetes jobs.
For #1 - you can do that in your agent YAML file as follows:
env:
- name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
value: ''
- name: PREFECT__CLOUD__API
value: "http://some_ip:4200/graphql" # paste your GraphQL Server endpoint here
- name: PREFECT__BACKEND
value: server
#2 - write your flow
#3 and #4 - this is more challenging to do in Prefect, as there is currently no load balancing mechanism aware of your infrastructure. There are some hacky solutions that you may try, but there is no first-class way to handle this in Prefect.
One hack would be: you build a parent flow that checks your infrastructure resources and depending on the outcome, it spins up your flow run with either DockerRun or KubernetesRun run config.
from prefect import Flow, task, case
from prefect.tasks.prefect import create_flow_run, wait_for_flow_run
from prefect.run_configs import DockerRun, KubernetesRun
#task
def check_the_infrastructure():
return "kubernetes"
with Flow("parent_flow") as flow:
infra = check_the_infrastructure()
with case(infra, "kubernetes"):
child_flow_run_id = create_flow_run(
flow_name="child_flow_name", run_config=KubernetesRun()
)
k8_child_flowrunview = wait_for_flow_run(
child_flow_run_id, raise_final_state=True, stream_logs=True
)
with case(infra, "docker"):
child_flow_run_id = create_flow_run(
flow_name="child_flow_name", run_config=DockerRun()
)
docker_child_flowrunview = wait_for_flow_run(
child_flow_run_id, raise_final_state=True, stream_logs=True
)
But note that this would require you to have 2 agents: Kubernetes agent and Docker agent running at all times

Getting the Azure ML environment build status

I am trying to set up a ML pipeline on Azure ML using the Python SDK.
I have scripted the creation of a custom environment from a DockerFile as follows
from azureml.core import Environment
from azureml.core.environment import ImageBuildDetails
from other_modules import workspace, env_name, dockerfile
custom_env : Environment = Environment.from_dockerfile(name=env_name, dockerfile=dockerfile)
custom_env.register(workspace=workspace)
build : ImageBuildDetails = custom_env.build(workspace=workspace)
build.wait_for_completion()
However, the ImageBuildDetails object that the build method returns invariably times out while executing the last wait_for_completion() line, ... likely due to network constraints that I cannot change.
So, how can I possibly check the build status via the SDK in a way that doesn't exclusively depend on the returned ImageBuildDetails object?
My first suggestion would be to use:
build.wait_for_completion(show_output=True)
This will help you debug better rather than assuming you have network issues, as the images can take quite a long time to build, and from my experience creating environments it's very likely you may have an issue with related to your Dockerfile.
A good alternative option is to build your docker image locally and optionally push it to the container registry associated with the workspace:
from azureml.core import Environment
myenv = Environment(name="myenv")
registered_env = myenv.register(workspace)
registered_env.build_local(workspace, useDocker=True, pushImageToWorkspaceAcr=True)
However another preferred method is to create an environment object from an environment specification YAML file:
from_conda_specification(name, file_path)
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-environments#use-conda-dependencies-or-pip-requirements-files
This should return the Environment and to verify it has been created:
for name,env in ws.environments.items():
print("Name {} \t version {}".format(name,env.version))
restored_environment = Environment.get(workspace=ws,name="myenv",version="1")
print("Attributes of restored environment")
restored_environment

Azure : Error 404: AciDeploymentFailed / Error 400 ACI Service request failed

I am trying to deploy a machine learning model through an ACI (Azure Container Instances) service. I am working in Python and I followed the following code (from the official documentation : https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli) :
The entry script file is the following (score.py):
import os
import dill
import joblib
def init():
global model
# Get the path where the deployed model can be found
model_path = os.getenv('AZUREML_MODEL_DIR')
# Load existing model
model = joblib.load('model.pkl')
# Handle request to the service
def run(data):
try:
# Pick out the text property of the JSON request
# Expected JSON details {"text": "some text to evaluate"}
data = json.loads(data)
prediction = model.predict(data['text'])
return prediction
except Exception as e:
error = str(e)
return error
And the model deployment workflow is as:
from azureml.core import Workspace
# Connect to workspace
ws = Workspace(subscription_id="my-subscription-id",
resource_group="my-ressource-group-name",
workspace_name="my-workspace-name")
from azureml.core.model import Model
model = Model.register(workspace = ws,
model_path= 'model.pkl',
model_name = 'my-model',
description = 'my-description')
from azureml.core.environment import Environment
# Name environment and call requirements file
# requirements: numpy, tensorflow
myenv = Environment.from_pip_requirements(name = 'myenv', file_path = 'requirements.txt')
from azureml.core.model import InferenceConfig
# Create inference configuration
inference_config = InferenceConfig(environment=myenv, entry_script='score.py')
from azureml.core.webservice import AciWebservice #AksWebservice
# Set the virtual machine capabilities
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 0.5, memory_gb = 3)
from azureml.core.model import Model
# Deploy ML model (Azure Container Instances)
service = Model.deploy(workspace=ws,
name='my-service-name',
models=[model],
inference_config=inference_config,
deployment_config=deployment_config)
service.wait_for_deployment(show_output = True)
I succeded once with the previous code. I noticed that during the deployment the Model.deploy created a container registry with a specific name (6e07ce2cc4ac4838b42d35cda8d38616).
The problem:
The API was working well and I wanted to deploy an other model from scratch. I deleted the API service and model from Azure ML Studio and the container registry from Azure ressources.
Unfortunately I am not able to deploy again anything.
Everything goes fine until the last step (the Model.deploy step), I have the following error message :
Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 46243f9b-3833-4650-8d47-3ac54a39dc5e
More information can be found here: https://machinelearnin2812599115.blob.core.windows.net/azureml/ImageLogs/46245f8b-3833-4659-8d47-3ac54a39dc5e/build.log?sv=2019-07-07&sr=b&sig=45kgNS4sbSZrQH%2Fp29Rhxzb7qC5Nf1hJ%2BLbRDpXJolk%3D&st=2021-10-25T17%3A20%3A49Z&se=2021-10-27T01%3A24%3A49Z&sp=r
Error:
{
"code": "AciDeploymentFailed",
"statusCode": 404,
"message": "No definition exists for Environment with Name: myenv Version: Autosave_2021-10-25T17:24:43Z_b1d066bf Reason: Container > registry 6e07ce2cc4ac4838b42d35cda8d38616.azurecr.io not found. If private link is enabled in workspace, please verify ACR is part of private > link and retry..",
"details": []
}
I do not understand why the first time a new container registry was well created, but now it seems that it is sought (the message is saying that container registry identified by name 6e07ce2cc4ac4838b42d35cda8d38616 is missing). I never found where I can force the creation of a new container registry ressource in Python, neither specify a name for it in AciWebservice.deploy_configuration or Model.deploy.
Does anyone could help me moving on with this? The best solution would be I think to delete totally this 6e07ce2cc4ac4838b42d35cda8d38616 container registry but I can't find where the reference is set so Model.deploy always fall to find it.
An other solution would be to force Model.deploy to generate a new container registry, but I could find how to make that.
It's been 2 days that I am on this and I really need your help !
PS : I am not at all a DEVOPS/MLOPS guy, I make data science and good models, but infrastructure and deployment is not really my thing so please be gentle on this part ! :-)
What I tried
Creating the container registry with same name
I tried to create the container registry by hand, but this time, this is the container that cannot be created. The Python output of the Model.deploy is the following :
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-10-25 19:25:10+02:00 Creating Container Registry if not exists.
2021-10-25 19:25:10+02:00 Registering the environment.
2021-10-25 19:25:13+02:00 Building image..
2021-10-25 19:30:45+02:00 Generating deployment configuration.
2021-10-25 19:30:46+02:00 Submitting deployment to compute.
Failed
Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 93780de6-7662-40d8-ab9e-4e1556ef880f
Current sub-operation type not known, more logs unavailable.
Error:
{
"code": "InaccessibleImage",
"statusCode": 400,
"message": "ACI Service request failed. Reason: The image '6e07ce2cc4ac4838b42d35cda8d38616.azurecr.io/azureml/azureml_684133370d8916c87f6230d213976ca5' in container group 'my-service-name-LM4HbqzEBEi0LTXNqNOGFQ' is not accessible. Please check the image and registry credential.. Refer to https://learn.microsoft.com/azure/container-registry/container-registry-authentication#admin-account and make sure Admin user is enabled for your container registry."
}
Setting admin user enabled
I tried to follow the recommandation of the last message saying to set Admin user enabled for the container registry. All what I saw in Azure interface is that a username and password appeared when enabling on user admin.
Unfortunately the same error message appears again if I try to relaunche my code and I am stucked here...
Changing name of the environment and model
This does not produces any change. Same errors.
As you tried with first attempt it was worked. After deleting the API service and model from Azure ML Studio and the container registry from Azure resources you are not able to redeploy again.
My assumption is your first attempt you are already register the Model Environment variable. So when you try to reregister by using the same model name while deploying it will gives you the error.
Thanks # anders swanson Your solution worked for me.
If you have already registered your env, myenv, and none of the details of the your environment have changed, there is no need re-register it with myenv.register(). You can simply get the already register env using Environment.get() like so:
myenv = Environment.get(ws, name='myenv', version=11)
My Suggestion is to name your environment as new value.
"model_scoring_env". Register it once, then pass it to the InferenceConfig.
Refer here

AWS - Neptune restore from snapshot using SDK

I'm trying to test restoring Neptune instances from a snapshot using python (boto3). Long story short, we want to spin up and delete the Dev instance daily using automation.
When restoring, my restore seems to only create the cluster without creating the attached instance. I have also tried creating an instance once the cluster is up and add to the cluster, but that doesn't work either. (ref: client.create_db_instance)
My code does as follows, get the most current snapshot. Use that variable to create the cluster so the most recent data is there.
import boto3
client = boto3.client('neptune')
response = client.describe_db_cluster_snapshots(
DBClusterIdentifier='neptune',
MaxRecords=100,
IncludeShared=False,
IncludePublic=False
)
snaps = response['DBClusterSnapshots']
snaps.sort(key=lambda c: c['SnapshotCreateTime'], reverse=True)
latest_snapshot = snaps[0]
snapshot_ID = latest_snapshot['DBClusterSnapshotIdentifier']
print("Latest snapshot: " + snapshot_ID)
db_response = client.restore_db_cluster_from_snapshot(
AvailabilityZones=['us-east-1c'],
DBClusterIdentifier='neptune-test',
SnapshotIdentifier=snapshot_ID,
Engine='neptune',
Port=8182,
VpcSecurityGroupIds=['sg-randomString'],
DBSubnetGroupName='default-vpc-groupID'
)
time.sleep(60)
db_instance_response = client.create_db_instance(
DBName='neptune',
DBInstanceIdentifier='brillium-neptune',
DBInstanceClass='db.r4.large',
Engine='neptune',
DBSecurityGroups=[
'sg-string',
],
AvailabilityZone='us-east-1c',
DBSubnetGroupName='default-vpc-string',
BackupRetentionPeriod=7,
Port=8182,
MultiAZ=False,
AutoMinorVersionUpgrade=True,
PubliclyAccessible=False,
DBClusterIdentifier='neptune-test',
StorageEncrypted=True
)
The documentation doesn't help much at all. It's very good at providing the variables needed for basic creation, but not the actual instance. If I attempt to create an instance using the same Cluster Name, it either errors out or creates a new cluster with the same name appended with '-1'.
If you want to programmatically do a restore from snapshot, then you need to:
Create the cluster snapshot using create-db-cluster-snapshot
Restore cluster from snapshot using restore-db-cluster-from-snapshot
Create an instance in the new cluster using create-db-instance
You mentioned that you did do a create-db-instance call in the end, but your example snippet does not have it. If that call did succeed, then you should see an instance provisioned inside that cluster.
When you do a restore from Snapshot using the Neptune Console, it does steps #2 and #3 for you.
It seems like you did the following:
Create the snapshot via CLI
Create the cluster via CLI
Create an instance in the cluster, via Console
Today, we recommend restoring the snapshot entirely via the Console or entirely using the CLI.

Resources