MLFlow model not logging to Azure Blob Storage - databricks

I am trying to use MLFlow to log artifacts to Azure Blob Storage. Though the logging to dbfs works fine, when I try to log it to Azure Blob Storage, I only see a folder with the corresponding runid but inside it there are no contents.
Here is what I do-
Create a experiment from Azure Databricks, give it a name and the artifacts location as wasbs://mlartifacts#myazurestorageaccount.blob.core.windows.net/ .
In the spark cluster, in the environemtn Variables section pass on the AZURE_STORAGE_ACCESS_KEY="ValueoftheKey"
In the notebook, use mlflow to log metrics, param and finally the model using a snippet like below
with mlflow.start_run():
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
lr.fit(train_x, train_y)
predicted_qualities = lr.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
print(" RMSE: %s" % rmse)
print(" MAE: %s" % mae)
print(" R2: %s" % r2)
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.log_metric("mae", mae)
mlflow.sklearn.log_model(lr, "model")
Of course before using it , I set the experiment to the one where I have defined the artifacts store to be azure blob storage
experiment_name = "/Users/user#domain.com/mltestazureblob"
mlflow.set_experiment(experiment_name)
The metrices and params I can from the MLFlow UI within Databricks but as since my artifacts location is Azure Blob Storage , I expect the model, the .pkl and conda.yaml file to be in the container in the Azure Blob Storage but when I go to check it, I only see a folder corresponding to the run id of the experiment but with nothing inside.
I do not know what I am missing. In case, someone needs additional details I will be happy to provide.
Point to note everything works fine when I use the default location i.e. dbfs.

Apparently it seems the problem was with Azure Storage Explorer. It does not show the contents of the folder (like the pkl, conda.yaml and the model file). However, when I used the Storage Explorer (preview) from Azure portal, I was able to view the contents (but that is also not very stable it seems).
I will raise a bug for Azure Storage Explorer team for them to take a look at this. I used 1.10.1 version of Azure Storage Explorer.

Related

Load data from public Azure blob in Matillion

I am going through Matillion Academy (Building a Data Warehouse). There is a slide deck to follow online and I am running my own instance of Matillion to recreate the building of the warehouse.
My Matillion is on Azure, as is my Snowflake database.
The training is AWS-based, but gives information about the adjustments needed for Azure or GS.
One of the steps shows how to Load data from blob storage. It is S3 based.
For Azure different components need to be used (as the S3 ones don't exist there), and data needs to be loaded from azure storage instead of S3 storage.
It also explains that for Snowflake on Azure yet another component needs to be used.
I have created a Stage in Snowflake:
CREATE STAGE "onlinemtlntrainingazure_flights"
URL='azure://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights'
The stage shows in Snowflake (external stage) and in Matillion (when using 'manage stages' on the database). The code is taken from the json file I imported to create the job to do this (see first step below).
I have created the target table in my database. It is accessible and visible in Matillion IDE.
The adjusted component I am to use is 'Azure Blob Storage Load'.
According to the documentation, I will need:
For Snowflake on Azure:
Create a Stage in Snowflake:
You should create a Stage in Snowflake which will be pointing to the
public data we provide. Please, find below the .json file containing
the job that will help you to do this. Don't forget to change the SQL
Script for pointing to your own schema
After Creating the Stage in Snowflake:
You should use the 'Create Table' and the 'Azure Blob Storage Load'
components individually as the 'Azure Blob Load Generator' won't let
you to select the Stage previously created. We have attached below the
Create Table metadata to save you some time.
'Azure Blob Storage Load' Settings:
Stage: onlinemtlntrainingazure_flights
Pattern: training_azure_flights_2016.gz
Target Table: training_flights
Record Delimiter: 0x0a
Skip Header: 1
The source data on Azure is located here:
Azure Blob Container (with flights data)
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2016.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2017.gz
https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2018.gz
Unfortunately, when using these settings on the 'Azure Blob Storage Load' component, it complains.
the stage does not appear in the list, and manually inputting the stage name yields an error (unrecognised option). prefixing the stage name with my schema (and even database) does not help.
azure storage location property does not accept the https://... URI to the data files. When I replace the 'https' by 'azure', or remove the part after the last '/' it complains with 'Unable to find an account with name: [onlinemtlntrainingazure]'
using [Custom] for stage property removes the error message, but when running the component, it comes back again with the 'Unable to find account'.
Any thoughts?
Edit: I found a workaround by using the Data Transfer Object, which first copies the files from the public https location to my own Azure blob location and then I process it further from there. But I would like to know how to do it as suggested in the training, and why it now fails.
The example files are in a storage account that your Azure Blob Storage Load Generator can not read from. But instead of using a Snowflake Stage, you might find it easier to just copy the files into a storage account that you do own, and then use the Azure Blob Storage Load Generator on the copied files.
In a Matillion ETL instance on Azure, you can access files over https and copy them into your own storage account using a Data Transfer component.
You already have the https:// source URLs for the three files, so:
Set the source type to HTTPS (no username or password is needed)
Add the source URL
Set the target type to Azure Blob Storage
In the example I used two variables, with defaults set to my storage account and container name
Repeat for all three files
After running the Data Transfer three times, you will then be able to proceed with the Azure Blob Storage Load Generator, reading from your own copies of the files.

azureml tabular dataset over azure gen2 datalake

What have I tried
set up an AzureML DataStore using Identity based authentication
set up an AzureML Dataset for a single file under a specific file system
workspace = Workspace.from_config("config.json", auth= auth)
dataset = Dataset.get_by_name(workspace, 'engage_event_type')
frame = dataset.to_pandas_dataframe()
I am able to explore the dataset from azure portal and it displays the right data correctly.
However running ^ where auth is a Service Principal which has the same rights as Azure Workspace Instance I get a bunch of calls like, but no errors / exceptions / completion.
The data underneath is < 10kb
Resolving access token for scope "https://datalake.azure.net//.default" using identity of type "SP".
Resolving access token for scope "https://datalake.azure.net//.default" using identity of type "SP".
I have tried running the script on a local compute
I have tried running the script on a compute instance
both gave the same issue

Running query using serverless sql pool (built-in) on CSV file in Azure Data Lake Storage Gen2 failed

I uploaded my CSV file into my Azure Data Lake Storage Gen2 using Azure Synapse portal. Then I tried select Top 100 rows and got an error after running auto-generated SQL.
Auto-generated SQL:
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Error:
File 'https://accountname.dfs.core.windows.net/filesystemname/test_file/contract.csv'
cannot be opened because it does not exist or it is used by another process.
This error in Synapse Studio has link (which leads to self-help document) underneath it which explains the error itself.
Do you have rights needed on the storage account?
You must have Storage Blob Data Contributor or Storage Blob Data Reader in order for this query to work.
Summary from the docs:
You need to have a Storage Blob Data Owner/Contributor/Reader role to
use your identity to access the data. Even if you are an Owner of a
Storage Account, you still need to add yourself into one of the
Storage Blob Data roles.
Check out the full documentation for Control Storage account access for serverless SQL pool
If your storage account is protected with firewall rules then take a look at this stack overflow answer.
Reference full docs article.
I just took your code & updated the path to what I have and it worked just worked fine
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://XXX.dfs.core.windows.net/himanshu/NYCTaxi/PassengerCountStats.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0'
) AS [result]
Please check if the path to which you have uploaded the file and the one used in the script is the same .
You can do this to check that
Navigate to WS -> Data -> ADLS gen2 -> Go to the file -> right click go to the property and copy the Uri from there paste in the script .

Get-AzMlWebService command not returning anything while updating existing ML model

I have a machine learning trained model and a web service created and deployed in Azure Machine Learning studio (Classic). I have retrained the model with the new data through Batch Execution and the output trained model is stored as ilearner file in the Azure Blob storage. Now, I'm trying to update the previous web service with the new through Azure powershell as mentioned in the documentation. Its connected to the Azure account with the Connect-AzAccount cmdlet but its not giving any output for the Get-AzMlWebService cmdlet. I have also tried many command variations (with the experiment name and resource group name)
Any help is appreciated. Thank you so much in advance.
Try Get-AmlWebService of the classic module.
# Get all classic Web Services in Workspace
$webServices = Get-AmlWebService
# Display them in table format
$webServices | Format-Table Id,Name,EndpointCount
# Get metadata of a specific classic Web Service with Id stored in $webSvcId
Get-AmlWebService -WebServiceId $webSvcId

How to access files stored in AzureDataLake and use this file as input to AzureBatchStep in azure.pipleline.step?

I registered an Azure data lake datastore as in the documentation in order to access the files stored in it.
I used
DataReference(datastore, data_reference_name=None, path_on_datastore=None, mode='mount', path_on_compute=None, overwrite=False)
and used it as input to azure pipeline step in AzureBatchStep method.
But I got an issue: that datastore name could not be fetched in input.
Is Azure Data Lake not accessible in Azure ML or am I getting it wrong?
Azure Data Lake is not supported as an input in AzureBatchStep. You should probably use a DataTransferStep to copy data from ADL to Blob and then use the output of the DataTransferStep as an input to AzureBatchStep.

Resources