Azure Stream Analytics: ML Service function call in cloud job results in no output events - azure

I've got a problem with an Azure Stream Analytics (ASA) job that should call an Azure ML Service function to score the provided input data.
The query was developed und tested in Visual Studio (VS) 2019 with the "Azure Data Lake and Stream Analytics Tools" Extension.
As input the job uses an Azure IoT-Hub and as output the VS local output for testing purposes (and later even with Blobstorage).
Within this environment everything works fine, the call to the ML Service function is successfull and it returns the desired response.
Using the same query, user-defined functions and aggregates like in VS in the cloud job, no output events are generated (with neither Blobstorage nor Power BI as output).
In the ML Webservice it can be seen, that ASA successfully calls the function, but somehow does not return any response data.
Deleting the ML function call from the query results in a successfull run of the job with output events.
For the deployment of the ML Webservice I tried the following (working for VS, no output in cloud):
ACI (1 CPU, 1 GB RAM)
AKS dev/test (Standard_B2s VM)
AKS production (Standard_D3_v2 VM)
The inference script function schema:
input: array
output: record
Inference script input schema looks like:
#input_schema('data', NumpyParameterType(input_sample, enforce_shape=False))
#output_schema(NumpyParameterType(output_sample)) # other parameter type for record caused error in ASA
def run(data):
response = {'score1': 0,
'score2': 0,
'score3': 0,
'score4': 0,
'score5': 0,
'highest_score': None}
And the return value:
return [response]
The ASA job subquery with ML function call:
with raw_scores as (
select
time, udf.HMMscore(udf.numpyfySeq(Sequence)) as score
from Sequence
)
and the UDF "numpyfySeq" like:
// creates a N x 18 size array
function numpyfySeq(Sequence) {
'use strict';
var transpose = m => m[0].map((x, i) => m.map(x => x[i]));
var array = [];
for (var feature in Sequence) {
if (feature != "time") {
array.push(Sequence[feature])
}
}
return transpose(array);
}
"Sequence" is a subquery that aggregates the data into sequences (arrays) with an user-defined aggregate.
In VS the data comes from the IoT-Hub (cloud input selected).
The "function signature" is recognized correctly in the portal as seen in the image: Function signature
I hope the provided information is sufficient and you can help me.
Edit:
The authentication for the Azure ML webservice is key-based.
In ASA, when selecting to use an "Azure ML Service" function, it will automatically detect and use the keys from the deployed ML model within the subscription and ML workspace.
Deployment code used (in this example for ACI, but looks nearly the same for AKS deployment):
from azureml.core.model import InferenceConfig, Model
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.webservice import AciWebservice
ws = Workspace.from_config()
env = Environment(name='scoring_env')
deps = CondaDependencies(conda_dependencies_file_path='./deps')
env.python.conda_dependencies = deps
inference_config = InferenceConfig(source_directory='./prediction/',
entry_script='score.py',
environment=env)
deployment_config = AciWebservice.deploy_configuration(auth_enabled=True, cpu_cores=1,
memory_gb=1)
model = Model(ws, 'HMM')
service = Model.deploy(ws, 'hmm-scoring', models,
inference_config,
deployment_config,
overwrite=True,)
service.wait_for_deployment(show_output=True)
with conda_dependencies:
name: project_environment
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.5.2 and later.
- python=3.7.5
- pip:
- sklearn
- azureml-core
- azureml-defaults
- inference-schema[numpy-support]
- hmmlearn
- numpy
- pip
channels:
- anaconda
- conda-forge
The code used in the score.py is just a regular score operation with the loaded models and formatting like so:
score1 = model1.score(data)
score2 = model2.score(data)
score3 = model3.score(data)
# Same scoring with model4 and model5
# scaling of the scores to a defined interval and determination of model that delivered highest score
response['score1'] = score1
response['score2'] = score2
# and so on

Related

How to get list of running VMs from AzureML

I am a beginner with Python and with AzureML.
Currently, my task is to list all the running VMs (or Compute Instances) with status and (if running) for how long they ran.
I managed to connect to AzureML and list Subscriptions, Resource Groups and Workspaces, but I'm stuck on how to list running VMs now.
Here's the code that I have currently:
# get subscriptions list using credentials
subscription_client = SubscriptionClient(credentials)
sub_list = subscription_client.subscriptions.list()
print("Subscription ID".ljust(column_width) + "Display name")
print(separator)
for group in list(sub_list):
print(f'{group.subscription_id:<{column_width}}{group.display_name}')
subscription_id = group.subscription_id
resource_client = ResourceManagementClient(credentials, subscription_id)
group_list = resource_client.resource_groups.list()
print(" Resource Groups:")
for group in list(group_list):
print(f" {group.name}{group.location}")
print(" Workspaces:")
my_ml_client = Workspace.list(subscription_id, credentials, group.name)
for ws in list(my_ml_client):
try:
print(f" {ws}")
if ws:
compute = ComputeTarget(workspace=ws, name=group.name)
print('Found existing compute: ' + group.name)
except:()
Please note that this is more or less a learning exercise and it's not the final shape of the code, I will refactor once I get it to work.
Edit: I found an easy way to do this:
workspace = Workspace(
subscription_id=subscription_id,
resource_group=group.name,
workspace_name=ws,
)
print(workspace.compute_targets)
Edit2: If anyone stumbles on this question and is just beginning to understand Python+Azure just like I do, all this information is from official documentation (which is just hard to follow as a beginner).
The result from 'workspace.compute_targets' will contain both Compute Instances and AML Instances.
If you need to retrieve only the VMs (like I do) you need to take an extra step to filter the result like this:
if type(compute_list[vm]) == ComputeInstance:

How do we do Batch Inferencing on Azure ML Service with Parameterized Dataset/DataPath input?

The ParallelRunStep Documentation suggests the following:
A named input Dataset (DatasetConsumptionConfig class)
path_on_datastore = iris_data.path('iris/')
input_iris_ds = Dataset.Tabular.from_delimited_files(path=path_on_datastore, validate=False)
named_iris_ds = input_iris_ds.as_named_input(iris_ds_name)
Which is just passed as an Input:
distributed_csv_iris_step = ParallelRunStep(
name='example-iris',
inputs=[named_iris_ds],
output=output_folder,
parallel_run_config=parallel_run_config,
arguments=['--model_name', 'iris-prs'],
allow_reuse=False
)
The Documentation to submit Dataset Inputs as Parameters suggests the following:
The Input is a DatasetConsumptionConfig class element
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
Which is passed in arguments as well in inputs
train_step = PythonScriptStep(
name="train_step",
script_name="train_with_dataset.py",
arguments=["--param2", tabular_ds_consumption],
inputs=[tabular_ds_consumption],
compute_target=compute_target,
source_directory=source_directory)
While submitting with new parameter we create a new Dataset class:
iris_tabular_ds = Dataset.Tabular.from_delimited_files('some_link')
And submit it like this:
pipeline_run_with_params = experiment.submit(pipeline, pipeline_parameters={'tabular_ds_param': iris_tabular_ds})
However, how do we combine this: How do we pass a Dataset Input as a Parameter to the ParallelRunStep?
If we create a DatasetConsumptionConfig class element like so:
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
And pass it as an argument in the ParallelRunStep, it will throw an error.
References:
Notebook with Dataset Input Parameter
ParallelRunStep Notebook
AML ParallelRunStep GA is a managed solution to scale up and out large ML workload, including batch inference, training and large data processing. Please check out below documents for the details.
• Overview doc: run batch inference using ParallelRunStep
• Sample notebooks
• AI Show: How to do Batch Inference using AML ParallelRunStep
• Blog: Batch Inference in Azure Machine Learning
For the inputs we create Dataset class instances:
tabular_ds1 = Dataset.Tabular.from_delimited_files('some_link')
tabular_ds2 = Dataset.Tabular.from_delimited_files('some_link')
ParallelRunStep produces an output file, we use the PipelineData class to create a folder which will store this output:
from azureml.pipeline.core import Pipeline, PipelineData
output_dir = PipelineData(name="inferences", datastore=def_data_store)
The ParallelRunStep depends on ParallelRunConfig Class to include details about the environment, entry script, output file name and other necessary definitions:
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
parallel_run_config = ParallelRunConfig(
source_directory=scripts_folder,
entry_script=script_file,
mini_batch_size=PipelineParameter(name="batch_size_param", default_value="5"),
error_threshold=10,
output_action="append_row",
append_row_file_name="mnist_outputs.txt",
environment=batch_env,
compute_target=compute_target,
process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
node_count=2
)
The input to ParallelRunStep is created using the following code
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_ds1)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
The PipelineParameter helps us run the pipeline for different datasets.
ParallelRunStep consumes this as an input:
parallelrun_step = ParallelRunStep(
name="some-name",
parallel_run_config=parallel_run_config,
inputs=[ tabular_ds_consumption ],
output=output_dir,
allow_reuse=False
)
To consume with another dataset:
pipeline_run_2 = experiment.submit(pipeline,
pipeline_parameters={"tabular_ds_param": tabular_ds2}
)
There is an error currently: DatasetConsumptionConfig and PipelineParameter cannot be reused

Restarting nested notebook runs in Databricks Job Workflow

I have a Databricks scheduled job which runs 5 different notebooks sequentially, and each notebook contains, let's say 5 different command cells. When the job fails in notebook 3, cmd cell 3, I can properly recover from failure, though I'm not sure if there's any way of either restarting the scheduled job from notebook 3, cell 4, or even from the beginning of notebook 4, if I've manually completed the remaining cmd's in notebook 3. Here's an example of one of my jobs
%python
import sys
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/02 Processing Curated Staging/02 Build - Parameterised/Load CS Feedback Firmware STG", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in Load CS Feedback Firmware STG ({error})')
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/03 Processing Curated Technical/02 Build - Parameterised/Load CS Feedback Firmware TECH", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in Load CS Feedback Firmware TECH ({error})')
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/02 Processing Curated Staging/02 Build - Parameterised/STA_6S - CS Firmware Success", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in STA_6S - CS Firmware Success ({error})')
you should not use sys.exit, because it quits Python interpreter. Just let exception bubble up if it happens.
you must change the architecture of your application and add some sort of idempotency to ETL (online course), which would mean propagating a date to child notebooks or something like that.
run %pip install retry in the beginning of the notebook to install retry package
from retry import retry, retry_call
#retry(Exception, tries=3)
def idempotent_run(notebook, timeout=6000, **args):
# this is only approximate code to be used for inspiration and you should adjust it to your needs. It's not guaranteed to work for your case.
did_it_run_before = spark.sql(f"SELECT COUNT(*) FROM meta.state WHERE notebook = '{notebook}' AND args = '{sorted(args.items())}'").first()[0]
if did_it_run_before > 0:
return
result = dbutils.notebook.run(notebook, timeout, args)
spark.sql(f"INSERT INTO meta.state SELECT '{notebook}' AS notebook, '{sorted(args.items())}' AS args")
return result
pd = dbutils.widgets.get("env_parent_directory")
# call this within respective cells.
idempotent_run(
f"/01. SMETS1Mig/{pd}/03 Processing Curated Technical/02 Build - Parameterised/Load CS Feedback Firmware TECH",
# set it to something, that would define the frequency of the job
this_date='2020-09-28',
env_ingest_db=dbutils.widgets.get("env_ingest_db"),
env_stg_db=dbutils.widgets.get("env_stg_db"),
env_tech_db=dbutils.widgets.get("env_tech_db"))

i want to list all Resource in organisations using python function in google cloud

I went through the official docs of google cloud but I don't have an idea how to use these to list resources of specific organization by providing the organization id
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
You can use Cloud Asset Inventory for this. I wrote this code for performing a sink in BigQuery.
import os
from google.cloud import asset_v1
from google.cloud.asset_v1.proto import asset_service_pb2
def asset_to_bq(request):
client = asset_v1.AssetServiceClient()
parent = 'organizations/{}'.format(os.getEnv('ORGANIZATION_ID'))
output_config = asset_service_pb2.OutputConfig()
output_config.bigquery_destination.dataset = 'projects/{}}/datasets/{}'.format(os.getEnv('PROJECT_ID'),
os.getEnv('DATASET'))
output_config.bigquery_destination.table = 'asset_export'
output_config.bigquery_destination.force = True
response = client.export_assets(parent, output_config)
# For waiting the finish
# response.result()
# Do stuff after export
return "done", 200
if __name__ == "__main__":
asset_to_bq('')
Be careful is you use it, the sink must be done in an empty/not existing table or set the force to true.
In my case, some minutes after the Cloud Scheduler that trigger my function and extract the data to BigQuery, I have a Scheduled Query into BigQuery that copy the data to another table, for keeping the history.
Note: It's also possible to configure an extract in Cloud Storage if you prefer.
I hope that is a starting point for you and for achieving what do you want to do.
I am able to list the project but I also want to list the folder and resources under folder and folder.name and tags and i also want to specify the organization id to resources information from a specific organization
import os
from google.cloud import resource_manager
def export_resource (organizations):
client = resource_manager.Client()
for project in client.list_projects():
print("%s, %s" % (project.project_id, project.status))

Dump series back into InfluxDB after querying with replaced field value

Scenario
I want to send data to an MQTT Broker (Cloud) by querying measurements from InfluxDB.
I have a field in the schema which is called status. It can either be 1 or 0. status=0 indicated that series has not been sent to the cloud. If I get an acknowlegdment from the MQTT Broker then I wish to rewrite the query back into the database with status=1.
As mentioned in FAQs for InfluxDB regarding Duplicate data If the information has the same timestamp as the previous query but with a different field value => then the update field will be shown.
In order to test this I created the following:
CREATE DATABASE dummy
USE dummy
INSERT meas_1, type=t1, status=0,value=123 1536157064275338300
query:
SELECT * FROM meas_1
provides
time status type value
1536157064275338300 0 t1 234
now if I want to overwrite the series I do the following:
INSERT meas_1, type=t1, status=1,value=123 1536157064275338300
which will overwrite the series
time status type value
1536157064275338300 1 t1 234
(Note: this is not possible via Tags currently in InfluxDB)
Usage
Query some information using the client with "status"=0.
Restructure JSON to be sent to the cloud
Send the information to cloud
If successful then write the output from Step 1. back into the DB but with status=1.
I am using the InfluxDBClient Python3 to create the Application (MQTT + InfluxDB)
Within the write_points API there is a parameter which mentions batch_size which require int as input.
I am not sure how can I use this with the Application that I want. Can someone guide me with this or with the Schema of the DB so that I can upload actual and non-redundant information to the cloud ?
The batch_size is actually the length of the list of the measurements that needs to passed to write_points.
Steps
Create client and query from measurement (here, we query gps information)
client = InfluxDBClient(database='dummy')
op = client.query('SELECT * FROM gps WHERE "status"=0', epoch='ns')
Make the ResultSet into a list:
batch = list(op.get_points('gps'))
create an empty list for update
updated_batch = []
parse through each measurement and change the status flag to 1. Note, default values in InfluxDB are float
for each in batch:
new_mes = {
'measurement': 'gps',
'tags': {
'type': 'gps'
},
'time': each['time'],
'fields': {
'lat': float(each['lat']),
'lon': float(each['lon']),
'alt': float(each['alt']),
'status': float(1)
}
}
updated_batch.append(new_mes)
Finally dump the points back via the client with batch_size as the length of the updated_batch
client.write_points(updated_batch, batch_size=len(updated_batch))
This overwrites the series because it contains the same timestamps with status field set to 1

Resources