.jpg file not loading in databricks from blob storage (Azure data lake) - python-3.x

I have the .jpg pictures in the data lake in my blob storage. I am trying to load the pictures and display them for testing purposes but it seems like they can't be loaded properly. I tried a few solutions but none of them showed the pictures.
path = 'abfss://dev#storage.dfs.core.windows.net/username/project_name/test.jpg'
spark.read.format("image").load(path) -- Method 1
display(spark.read.format("binaryFile").load(pic)) -- Method 2
Method 1 brought this error. It looks like a binary file (converted from jpg to binary) and that's why I tried solution 2 but that did not load anything either.
Out[51]: DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>]
For method 2, I see this error when I ran it.
SecurityException: Your administrator has forbidden Scala UDFs from being run on this cluster
I cannot install the libraries very easily. It needs to be reviewed and approved by the administrators first so please suggest something with spark and/or python libraries if possible. Thanks
Edit:
I added these 2 lines and it looks like the image has been read but it cannot be loaded for some reason. I am not sure what's going on. The goal is to read it properly and decode the pictures eventually but it cannot happen until the picture is loaded properly.
df = spark.read.format("image").load(path)
df.show()
df.printSchema()

I tried to reproduce the same in my environment, loading dataset into databricks and got below results:
Mount an Azure Data Lake Storage Gen2 Account in Databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "xxxxxxxxx",
"fs.azure.account.oauth2.client.secret": "xxxxxxxxx",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/xxxxxxxxx/oauth2/v2.0/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder_name>",
mount_point = "/mnt/<folder_name>",
extra_configs = configs)
Or
Mount storage account with databricks using access key:
dbutils.fs.mount(
source = "wasbs://<container>#<Storage_account_name>.blob.core.windows.net/",
mount_point = "/mnt/df1/",
extra_configs = {"fs.azure.account.key.vamblob.blob.core.windows.net":"<access_key>"})
Now, using below code, I got these results.
sample_img_dir= "/mnt/df1/"
image_df = spark.read.format("image").load(sample_img_dir)
display(image_df)
Update:
spark.conf.set("fs.azure.account.key.<storage_account>.dfs.core.windows.net","<access_key>")
Reference:
Mounting Azure Data Lake Storage Gen2 Account in Databricks By Ron L'Esteve.

Finally after spending hours on this, I found a solution which was pretty straight forward but drove me crazy. I was on a right path but needed to print the pictures using "binaryFile" format in Spark. Here is what worked for me.
## importing libraries
import io
import matplotlib.pyplot as plt
import matplotlib. Image as mpimg
## Directory
path_dir = 'abfss://dev#container.dfs.core.windows.net/username/projectname/a.jpg'
## Reading the images
df = spark.read.format("binaryFile").load(path_dir)
## Selecting the path and content
df = df.select('path','content')
## Taking out the image
image_list = df.select("content").rdd.flatMap(lambda x: x).collect()
image = mpimg.imread(io.BytesIO(image_list[0]), format='jpg')
plt.imshow(image)
It looks like binaryFile is a right format at least in this case and the upper code was able to decode it successfully.

Related

Fastest way to combine multiple CSV files from a blob storage container into one CSV file on another blob storage container in an Azure function

I'd like to hear if it's possible can improve the code below to make it run faster (and maybe cheaper) as part of an Azure function for combining multiple CSV files from a source blob storage container into one CSV file on a target blob storage container on Azure by using Python (please note that it would also be fine for me to use another library than pandas if need be)?
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames))
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)
Instead of using Azure Function, you can use Azure Data Factory to concatenate your files.
It will probably have better efficiency with ADF than Azure Functions with pandas.
Take a look at this blog post https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
If you want to use Azure function, try to concatenate files without using pandas. If all your files have the same columns and the same column order, you can concatenate string directly and remove the header line, if any, of all files but the first.

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

AzureML create dataset from datastore with multiple files - path not valid

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")
Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.
I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")

Can't use wildcard with Azure Data Lake Gen2 files

I was able to properly connect my Data Lake Gen2 Storage Account with my Azure ML Workspace. When trying to read a specific set of Parquet files from the Datastore, it will take forever and will not load it.
The code looks like:
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath
ws = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(ws, 'my-datastore')
files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/*.parquet'
dataset = Dataset.Tabular.from_parquet_files(path=[DataPath(datastore, files_path)], validate=False)
df = dataset.take(1000)
df.to_pandas_dataframe()
Each of these Parquet files have approx. 300kB. There are 200 of them on the folder - generic and straight out of Databricks. Strange is that when I try and read one single parquet file from the exact same folder, it runs smoothly.
Second is that other folders that contain less than say 20 files, will also run smoothly, so I eliminated the possibility that this was due to some connectivity issue. And even stranger is that I tried the wildcard like the following:
# files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/part-00000-*.parquet'
And theoretically this will only direct me to the 00000 file, but it will also not load. Super weird.
To try to overcome this, I have tried to connect to the Data Lake through ADLFS with Dask, and it just works. I know this can be a workaround for processing "large" datasets/files, but it would be super nice to do it straight from the Dataset class methods.
Any thoughts?
EDIT: typo
The issue can be solved if you update some packages with the following command:
pip install --upgrade azureml-dataprep azureml-dataprep-rslex
This is something that will come out fixed in the next azureml.core update, as I was told by some folks at Microsoft.

reading a csv file from azure blob storage with PySpark

I'm trying to do a machine learning project using a PySpark HDInsight cluster on Microsoft Azure. To operate on my cluster a use a Jupyter notebook. Also, I have my data (a csv file), stored on the Azure Blob storage.
According to the documentation the syntax of the path to my file is:
path = 'wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
However, when i try to read the csv file with the following command:
csvFile = spark.read.csv(path, header=True, inferSchema=True)
I get the following error:
'java.net.URISyntaxException: Illegal character in scheme name at index 4: wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
Here is a screenshot of the the error looks like in the notebook:
Any ideas on how to fix this?
It is either (unencrypted):
wasb://...
or (encrypted):
wasbs://...
not
wasb[s]://...

Resources