Can't use wildcard with Azure Data Lake Gen2 files - azure

I was able to properly connect my Data Lake Gen2 Storage Account with my Azure ML Workspace. When trying to read a specific set of Parquet files from the Datastore, it will take forever and will not load it.
The code looks like:
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath
ws = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(ws, 'my-datastore')
files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/*.parquet'
dataset = Dataset.Tabular.from_parquet_files(path=[DataPath(datastore, files_path)], validate=False)
df = dataset.take(1000)
df.to_pandas_dataframe()
Each of these Parquet files have approx. 300kB. There are 200 of them on the folder - generic and straight out of Databricks. Strange is that when I try and read one single parquet file from the exact same folder, it runs smoothly.
Second is that other folders that contain less than say 20 files, will also run smoothly, so I eliminated the possibility that this was due to some connectivity issue. And even stranger is that I tried the wildcard like the following:
# files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/part-00000-*.parquet'
And theoretically this will only direct me to the 00000 file, but it will also not load. Super weird.
To try to overcome this, I have tried to connect to the Data Lake through ADLFS with Dask, and it just works. I know this can be a workaround for processing "large" datasets/files, but it would be super nice to do it straight from the Dataset class methods.
Any thoughts?
EDIT: typo

The issue can be solved if you update some packages with the following command:
pip install --upgrade azureml-dataprep azureml-dataprep-rslex
This is something that will come out fixed in the next azureml.core update, as I was told by some folks at Microsoft.

Related

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

AzureML create dataset from datastore with multiple files - path not valid

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")
Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.
I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")

Azure ML SDK DataReference - File Pattern - MANY files

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

Azure Storage Explorer: Properties of type '' are not supported error

I inherited a project that uses an Azure table storage database. I'm using Microsoft Azure Storage Explorer as a tool to query and manage data. I'm attempting to migrate data from my Dev database to my QA database. To do this, I'm exporting a CSV from a Dev database table and then trying to import into the QA database table. For a small number of tables, I get the following error when I try to import the CSV:
Failed: Properties of type '' are not supported.
When I ran into this before, since I exported a "typed" CSV from Dev, I checked to make sure all "#type" columns had values. They did. Then I split the CSV (with thousands of records) up into smaller files to try to determine which record was the issue. When I did this and started importing them, I was ultimately able to import all of the records successfully by individual files which is peculiar. Almost like a constraint violation issue.
I'm also seeing errors with different types. Eg:
Properties of type 'Double' are not supported.
In this case, there is already a column in the particular table of type "Double".
Anyway, now that I'm seeing it again, I'm having trouble resolving it. Any thoughts?
UPDATE
I was able to track a few of these errors to "bad" data in the CSV. It was a JSON string in a Edm.String field that for some reason, it wasn't liking. I minified the JSON using an online tool and it imported fine. There is one data set, though, that has over 7,000 records I'm trying to import (the one I referenced breaking up previously earlier in this post). I ended up breaking it up into different files and was able to successfully import them individually. When I try to import the entire file after loading all the data through individual files, though, I again get an error.
I split the CSV (with thousands of records) up into smaller files to try to determine which record was the issue. When I did this and started importing them, I was ultimately able to import all of the records successfully by individual files which is peculiar.
Based on your test, the format and data of source CSV file seems ok. It will be difficult to find out why Azure Storage Explorer return those unexpected error while importing large CSV file. You can try to upgrade your Azure Storage Explorer and check if you can export and import data successfully using the latest Azure Storage Explorer.
Besides, you can try to use AzCopy (designed for copying data to and from Microsoft Azure Blob, File, and Table storage using simple commands with optimal performance) to export/import table.
Export table:
AzCopy /Source:https://myaccount.table.core.windows.net/myTable/ /Dest:C:\myfolder\ /SourceKey:key /Manifest:abc.manifest
Import table:
AzCopy /Source:C:\myfolder\ /Dest:https://myaccount.table.core.windows.net/mytable1/ /DestKey:key /Manifest:"abc.manifest" /EntityOperation:InsertOrReplace

How to export data from a dataframe to a file databricks

I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.

Resources