Reading GeoJSON in databricks, no mount point set - databricks

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!

Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Related

How to open index.html file in databricks or browser?

I am trying to open index.html file through databricks. Can someone please let me know how to deal with it? I am trying to use GX with databricks and currently, data bricks store this file here: dbfs:/great_expectations/uncommitted/data_docs/local_site/index.html I want to send index.html file to stakeholder
I suspect that you need to copy the whole folder as there should be images, etc. Simplest way to do that is to use Dataricks CLI fs cp command to access DBFS and copy files to the local storage. Like this:
databricks fs cp -r 'dbfs:/.....' local_name
To open file directly in the notebook you can use something like this (note that dbfs:/ should be replaced with /dbfs/):
with open("/dbfs/...", "r") as f:
data = "".join([l for l in f])
displayHTML(data)
but this will break links to images. Alternatively you can follow this approach to display Data docs inside the notebook.

Downloading S3 files in Google Colab

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?
You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

Azure ML SDK DataReference - File Pattern - MANY files

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file
When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally,
therefore cannot take input from the S3 bucket. To generate the list
file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

Spark - Folder with same name as text file automatically created after RDD?

I placed a text file named Linecount2.txt in hdfs and built a simple rdd to count the number of lines using spark.
val lines = sc.textFile("user/root/hdpcd/Linecount2.txt")
lines.count()
This works.
But when I tried using the same text file with the aforementioned path, I receive the error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
When I looked into that path, I could see a folder was created 'Linecount.txt'.Hence the path for the file is now
("user/root/hdpcd/Linecount2.txt/Linecount2.txt")
Then, after defining the path I was able to run it successfully.
The third time I tried this, I got the same error because input path doesn't exist.
When I went through the path,
Why does this happen?
There is a difference between putting an HDFS file at user/root/hdpcd/Linecount2.txt compared to /user/root/hdpcd/Linecount2.txt, (or, more simply hdpcd/Linecount2.txt, when you already are the root user)
The leading slash is very important if you want to place a file in an absolute directory other than your current user account, otherwise, that's the default.
You've not given your hdfs put command, but the issue here is simply the difference between the absolute and relative paths. And it's not Spark specifically that's the issue
Also, hdfs put will say that a file already exists if you try to place it in the same location, so the fact you were able to upload twice should be an indication that your path was incorrect

Resources