I want to list down the Notebooks in a folder in Databricks.
I tried to use the utilities like , dbutils.fs.ls("/path") - > It shows the path of the storage folder.
I also tried to check dbutil.notebook.help() - nothing useful.
Lets say, there is a folder - XXYY. This folder contains multiple Notebooks. I just want the list of the Notebooks in that folder. Is their any command available. Any suggestion.
dbutils.fs.ls works with files on the Databricks File System, so you can't use it.
You can get what you need via the the List command of Workspaces REST API - you can call your local workspace with something like this (this code doesn't include parsing of the output):
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
response = requests.post(
f'https://{host_name}/api/2.0/workspace/list',
headers={'Authorization': f'Bearer {host_token}'},
data = {'path': '<your-path>'}
).json()
# add parsing of output here
Or you can use databricks workspace ls command from databricks-cli.
Related
I'm trying to fetch file from Google Drive using Apache Beam. I tried,
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
lines = (pipeline | beam.Create(filenames))
print(lines)
This returns a string like PCollection[[19]: Create/Map(decode).None]
I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?
If you don’t have complex transformations to apply, I thinks it’s better to not use Beam in this case.
Solution 1 :
You can instead use Google Collab (Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.
You can check the following links :
google-drive-to-gcs
stackoverflow-copy-file-from-google-drive-to-gcs
Solution 2
You can also use APIs to retrieve files from Google Drive and copy them to Cloud Storage.
You can for example develop a Python script using Python Google clients and the following packages :
google-api-python-client
google-auth-httplib2
google-auth-oauthlib
google-cloud-storage
This article shows an example.
If you want to use Beam for this, you would could write a function
def read_from_gdrive_and_yield_records(path):
...
and then use it like
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
paths = pipeline | beam.Create(filenames)
records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
records | beam.io.WriteToText('gs://...')
Though as mentioned, unless you have a lot of files, this may be overkill.
I am trying to open index.html file through databricks. Can someone please let me know how to deal with it? I am trying to use GX with databricks and currently, data bricks store this file here: dbfs:/great_expectations/uncommitted/data_docs/local_site/index.html I want to send index.html file to stakeholder
I suspect that you need to copy the whole folder as there should be images, etc. Simplest way to do that is to use Dataricks CLI fs cp command to access DBFS and copy files to the local storage. Like this:
databricks fs cp -r 'dbfs:/.....' local_name
To open file directly in the notebook you can use something like this (note that dbfs:/ should be replaced with /dbfs/):
with open("/dbfs/...", "r") as f:
data = "".join([l for l in f])
displayHTML(data)
but this will break links to images. Alternatively you can follow this approach to display Data docs inside the notebook.
We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")
I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")
Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.
I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")
I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.