AzureML create dataset from datastore with multiple files - path not valid - azure-machine-learning-service

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")

Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.

I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")

Related

.jpg file not loading in databricks from blob storage (Azure data lake)

I have the .jpg pictures in the data lake in my blob storage. I am trying to load the pictures and display them for testing purposes but it seems like they can't be loaded properly. I tried a few solutions but none of them showed the pictures.
path = 'abfss://dev#storage.dfs.core.windows.net/username/project_name/test.jpg'
spark.read.format("image").load(path) -- Method 1
display(spark.read.format("binaryFile").load(pic)) -- Method 2
Method 1 brought this error. It looks like a binary file (converted from jpg to binary) and that's why I tried solution 2 but that did not load anything either.
Out[51]: DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>]
For method 2, I see this error when I ran it.
SecurityException: Your administrator has forbidden Scala UDFs from being run on this cluster
I cannot install the libraries very easily. It needs to be reviewed and approved by the administrators first so please suggest something with spark and/or python libraries if possible. Thanks
Edit:
I added these 2 lines and it looks like the image has been read but it cannot be loaded for some reason. I am not sure what's going on. The goal is to read it properly and decode the pictures eventually but it cannot happen until the picture is loaded properly.
df = spark.read.format("image").load(path)
df.show()
df.printSchema()
I tried to reproduce the same in my environment, loading dataset into databricks and got below results:
Mount an Azure Data Lake Storage Gen2 Account in Databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "xxxxxxxxx",
"fs.azure.account.oauth2.client.secret": "xxxxxxxxx",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/xxxxxxxxx/oauth2/v2.0/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder_name>",
mount_point = "/mnt/<folder_name>",
extra_configs = configs)
Or
Mount storage account with databricks using access key:
dbutils.fs.mount(
source = "wasbs://<container>#<Storage_account_name>.blob.core.windows.net/",
mount_point = "/mnt/df1/",
extra_configs = {"fs.azure.account.key.vamblob.blob.core.windows.net":"<access_key>"})
Now, using below code, I got these results.
sample_img_dir= "/mnt/df1/"
image_df = spark.read.format("image").load(sample_img_dir)
display(image_df)
Update:
spark.conf.set("fs.azure.account.key.<storage_account>.dfs.core.windows.net","<access_key>")
Reference:
Mounting Azure Data Lake Storage Gen2 Account in Databricks By Ron L'Esteve.
Finally after spending hours on this, I found a solution which was pretty straight forward but drove me crazy. I was on a right path but needed to print the pictures using "binaryFile" format in Spark. Here is what worked for me.
## importing libraries
import io
import matplotlib.pyplot as plt
import matplotlib. Image as mpimg
## Directory
path_dir = 'abfss://dev#container.dfs.core.windows.net/username/projectname/a.jpg'
## Reading the images
df = spark.read.format("binaryFile").load(path_dir)
## Selecting the path and content
df = df.select('path','content')
## Taking out the image
image_list = df.select("content").rdd.flatMap(lambda x: x).collect()
image = mpimg.imread(io.BytesIO(image_list[0]), format='jpg')
plt.imshow(image)
It looks like binaryFile is a right format at least in this case and the upper code was able to decode it successfully.

Fastest way to combine multiple CSV files from a blob storage container into one CSV file on another blob storage container in an Azure function

I'd like to hear if it's possible can improve the code below to make it run faster (and maybe cheaper) as part of an Azure function for combining multiple CSV files from a source blob storage container into one CSV file on a target blob storage container on Azure by using Python (please note that it would also be fine for me to use another library than pandas if need be)?
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames))
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)
Instead of using Azure Function, you can use Azure Data Factory to concatenate your files.
It will probably have better efficiency with ADF than Azure Functions with pandas.
Take a look at this blog post https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
If you want to use Azure function, try to concatenate files without using pandas. If all your files have the same columns and the same column order, you can concatenate string directly and remove the header line, if any, of all files but the first.

Save and load a spacy model to a google cloud storage bucket

I have a spacy model and I am trying to save it to a gcs bucket using this format
trainer.to_disk('gs://{bucket-name}/model')
But each time I run this I get this error message
FileNotFoundError: [Errno 2] No such file or directory: 'gs:/{bucket-name}/model'
Also when I create a kubeflow persistent volume and save the model there I can download the model using trainer.load('model') I get this error message
File "/usr/local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 175, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model '/model/'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
I don't understand why I am having these errors as this works perfectly when I run this on my pc locally and use a local path.
Cloud storage is not a local disk or a physical storage unit where you can save things directly to.
As you say
this on my pc locally and use a local path
Cloud Storage is virtually not a local path of any other tool in the cloud
If you are using python you will have to create a client using the Storage library and then upload your file using upload_blob i.e.:
from google.cloud import storage
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# bucket_name = "your-bucket-name"
# source_file_name = "local/path/to/file"
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
Since you've tagged this question "kubeflow-pipelines", I'll answer from that perspective.
KFP strives to be platform-agnostic. Most good components are cloud-independent.
KFP promotes system-managed artifact passing where the components code only writes output data to local files and the system takes it and makes it available for other components.
So, it's best to describe your SpaCy model trainer that way - to write data to local files. Check how all other components work, for example, Train Keras classifier.
Since you want to upload to GCS, do that explicitly, but passing the model output of your trainer to an "Upload to GCS" component:
upload_to_gcs_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/616542ac0f789914f4eb53438da713dd3004fba4/components/google-cloud/storage/upload_to_explicit_uri/component.yaml')
def my_pipeline():
model = train_specy_model(...).outputs['model']
upload_to_gcs_op(
data=model,
gcs_path='gs:/.....',
)
The following implementation assumes you have gsutil installed in your computer. The spaCy version used was 3.2.4. In my case, I wanted everything to be part of a (demo) single Python file, spacy_import_export.py. To do so, I had to use subprocess python library, plus this comment, as follows:
# spacy_import_export.py
import spacy
import subprocess # Will be used later
# spaCy models trained by user, are always stored as LOCAL directories, with more subdirectories and files in it.
PATH_TO_MODEL = "/home/jupyter/" # Use your own path!
# Test-loading your "trainer" (optional step)
trainer = spacy.load(PATH_TO_MODEL+"model")
# Replace 'bucket-name' with the one of your own:
bucket_name = "destination-bucket-name"
GCS_BUCKET = "gs://{}/model".format(bucket_name)
# This does the trick for the UPLOAD to Cloud Storage:
# TIP: Just for security, check Cloud Storage afterwards: "model" should be in GCS_BUCKET
subprocess.run(["gsutil", "-m", "cp", "-r", PATH_TO_MODEL+"model", GCS_BUCKET])
# This does the trick for the DOWNLOAD:
# HINT: By now, in PATH_TO_MODEL, you should have a "model" & "downloaded_model"
subprocess.run(["gsutil", "-m", "cp", "-r", GCS_BUCKET+MODEL_NAME+"/*", PATH_TO_MODEL+"downloaded_model"])
# Test-loading your "GCS downloaded model" (optional step)
nlp_original = spacy.load(PATH_TO_MODEL+"downloaded_model")
I apologize for the excess of comments, I just wanted to make everything clear, for "spaCy newcomers". I know it is a bit late, but hope it helps.

Can't use wildcard with Azure Data Lake Gen2 files

I was able to properly connect my Data Lake Gen2 Storage Account with my Azure ML Workspace. When trying to read a specific set of Parquet files from the Datastore, it will take forever and will not load it.
The code looks like:
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath
ws = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(ws, 'my-datastore')
files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/*.parquet'
dataset = Dataset.Tabular.from_parquet_files(path=[DataPath(datastore, files_path)], validate=False)
df = dataset.take(1000)
df.to_pandas_dataframe()
Each of these Parquet files have approx. 300kB. There are 200 of them on the folder - generic and straight out of Databricks. Strange is that when I try and read one single parquet file from the exact same folder, it runs smoothly.
Second is that other folders that contain less than say 20 files, will also run smoothly, so I eliminated the possibility that this was due to some connectivity issue. And even stranger is that I tried the wildcard like the following:
# files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/part-00000-*.parquet'
And theoretically this will only direct me to the 00000 file, but it will also not load. Super weird.
To try to overcome this, I have tried to connect to the Data Lake through ADLFS with Dask, and it just works. I know this can be a workaround for processing "large" datasets/files, but it would be super nice to do it straight from the Dataset class methods.
Any thoughts?
EDIT: typo
The issue can be solved if you update some packages with the following command:
pip install --upgrade azureml-dataprep azureml-dataprep-rslex
This is something that will come out fixed in the next azureml.core update, as I was told by some folks at Microsoft.

load csv and set parameters in jupyter notebook on Azure ML

I'm using a Python 3.4 Jupyter notebook to load a dataset in Azure ML which is stored in the cloud as a dataset in the Azure ML project environment. But using the default template created by Azure ML, I can't load the data due to a mixed datatypes error.
from azureml import Workspace
import pandas as pd
ws = Workspace()
ds = ws.datasets['rossmann-train.csv']
df = ds.to_dataframe()
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/kernel/main.py:6: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
In my local environment I just import the dataset as follows:
df = pd.read_csv('train.csv',low_memory=False)
But I'm not sure how to do this in azure using the ds object.
df = pd.read_csv(ds)
and
pd.DataFrame.from_csv(ds)
raise the error:
OSError: Expected file path name or file-like object, got type
*edit: more info on the ds object:
In [1]: type(ds)
Out [1]: azureml.SourceDataset
In [2]: print (ds)
Out [2]: rossmann-train.csv
First of all, I am not sure, by your question, what is the ds object. But I'm pretty sure it is not a csv file, since, if it were, you'd have processed it your self and you wouldn't be having this question.
Now, I am not sure whether pandas has a native way of dealing with Azure, but this piece of documentation indicates that first you must download the data form Azure, using their package, and save it into your local file system.
But for that, they are assuming that the data you downloaded is already in the csv format. If not, use the appropriate reader (or parse it by hand) in order to tabulate the data for a pandas.DataFrame.
According to the docs on the azureml library, one workaround would be to import the file as text then parse it into csv but this seems unnecessary since the data is already recognised as being in csv structure.
text_data = ds.read_as_text()

Resources