Save and load a spacy model to a google cloud storage bucket - python-3.x

I have a spacy model and I am trying to save it to a gcs bucket using this format
trainer.to_disk('gs://{bucket-name}/model')
But each time I run this I get this error message
FileNotFoundError: [Errno 2] No such file or directory: 'gs:/{bucket-name}/model'
Also when I create a kubeflow persistent volume and save the model there I can download the model using trainer.load('model') I get this error message
File "/usr/local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 175, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model '/model/'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
I don't understand why I am having these errors as this works perfectly when I run this on my pc locally and use a local path.

Cloud storage is not a local disk or a physical storage unit where you can save things directly to.
As you say
this on my pc locally and use a local path
Cloud Storage is virtually not a local path of any other tool in the cloud
If you are using python you will have to create a client using the Storage library and then upload your file using upload_blob i.e.:
from google.cloud import storage
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# bucket_name = "your-bucket-name"
# source_file_name = "local/path/to/file"
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)

Since you've tagged this question "kubeflow-pipelines", I'll answer from that perspective.
KFP strives to be platform-agnostic. Most good components are cloud-independent.
KFP promotes system-managed artifact passing where the components code only writes output data to local files and the system takes it and makes it available for other components.
So, it's best to describe your SpaCy model trainer that way - to write data to local files. Check how all other components work, for example, Train Keras classifier.
Since you want to upload to GCS, do that explicitly, but passing the model output of your trainer to an "Upload to GCS" component:
upload_to_gcs_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/616542ac0f789914f4eb53438da713dd3004fba4/components/google-cloud/storage/upload_to_explicit_uri/component.yaml')
def my_pipeline():
model = train_specy_model(...).outputs['model']
upload_to_gcs_op(
data=model,
gcs_path='gs:/.....',
)

The following implementation assumes you have gsutil installed in your computer. The spaCy version used was 3.2.4. In my case, I wanted everything to be part of a (demo) single Python file, spacy_import_export.py. To do so, I had to use subprocess python library, plus this comment, as follows:
# spacy_import_export.py
import spacy
import subprocess # Will be used later
# spaCy models trained by user, are always stored as LOCAL directories, with more subdirectories and files in it.
PATH_TO_MODEL = "/home/jupyter/" # Use your own path!
# Test-loading your "trainer" (optional step)
trainer = spacy.load(PATH_TO_MODEL+"model")
# Replace 'bucket-name' with the one of your own:
bucket_name = "destination-bucket-name"
GCS_BUCKET = "gs://{}/model".format(bucket_name)
# This does the trick for the UPLOAD to Cloud Storage:
# TIP: Just for security, check Cloud Storage afterwards: "model" should be in GCS_BUCKET
subprocess.run(["gsutil", "-m", "cp", "-r", PATH_TO_MODEL+"model", GCS_BUCKET])
# This does the trick for the DOWNLOAD:
# HINT: By now, in PATH_TO_MODEL, you should have a "model" & "downloaded_model"
subprocess.run(["gsutil", "-m", "cp", "-r", GCS_BUCKET+MODEL_NAME+"/*", PATH_TO_MODEL+"downloaded_model"])
# Test-loading your "GCS downloaded model" (optional step)
nlp_original = spacy.load(PATH_TO_MODEL+"downloaded_model")
I apologize for the excess of comments, I just wanted to make everything clear, for "spaCy newcomers". I know it is a bit late, but hope it helps.

Related

Read a large hdf5 file from url by chunk in python

I have a 1.5 terabyte sized hdf5 file on an Amazon Simple Storage Service located at the link below. I don't have the disk space to save it nor do I have the memory to read it. Accordingly, I want to read it by chunk, process it, and discard the read part. I was hoping to use pandas' read_hdf to read it but it does not support urls. Neither does the h5py library it seems. Though it does mention a ros3 driver but I haven't been able to get it to work yet. I also tried the response to this question but the chunks cannot be read by h5py or I have not found a way yet. So I'm rather left with no idea on how to process this file. Does anyone have any idea how to do so? The link to the file is this:
https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5
After having this exact same issue, I believe I've cobbled together a working solution for this using fsspec:
import h5py
import fsspec
URL = "..." # Assuming a publicly accessible url
remote_f = fsspec.open(URL, mode="rb")
if hasattr(remote_f, "open"):
remote_f = remote_f.open()
f = h5py.File(remote_f)
# Do regular hdf5 things...
I've confirmed, using your link above, that this does not read the data into memory, just as if it were a local file:
import h5py
import fsspec
URL = "https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5"
remote_f = fsspec.open(URL, mode="rb")
if hasattr(remote_f, "open"):
remote_f = remote_f.open()
f = h5py.File(remote_f)
f.visititems(print)
# 1. README <HDF5 dataset "1. README": shape (), type "|O">
# 2. Resources <HDF5 group "/2. Resources" (2 members)>
# 2. Resources/2.1. Building Models <HDF5 group "/2. Resources/2.1. Building Models" (9 members)>
...

How to avoid writing tensorboard logs to local directory when using s3 path?

I want to write logs to s3 and tensorboard does that for me. However, it writes empty folders to local directory too. I am using pytorch lightning and tensorboard, the code is like:
data = DataModule(**dict_args)
model = BaseModel(**dict_args)
tb_logger = TensorBoardLogger("s3://my-bucket/path")
trainer = pl.Trainer.from_argparse_args(
args,
logger=tb_logger,
max_epochs=args.epochs,
)
trainer.fit(model, datamodule=data)
This snippet of code writes logs to both s3 and a local directory s3:/my-bucket/path. There are no files under the directory, which is good, but I wonder if I can do something to avoid writing to local directory.

AzureML create dataset from datastore with multiple files - path not valid

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")
Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.
I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")

using keras's .flow_from_directory() on mounted s3 bucket in databricks

I am trying to build a convolutional neural network in databricks using Spark 2.4.4 and a Scala 2.11 backend in python. I have build CNN's before but this is my first time with using Spark(databricks) and AWS s3.
The files in AWS are oredered like this:
train_test_small/(train or test)/(0,1,2 or 3)/
And then a list of images in every directory corresponding to their category(0,1,2,3)
In order to access my files stored in the s3 bucket I mounted the bucket to databricks like this:
# load in the image files
WS_BUCKET_NAME = "sensored_bucket_name/video_topic_modelling/data/train_test_small"
MOUNT_NAME = "train_test_small"
dbutils.fs.mount("s3a://%s" % AWS_BUCKET_NAME, "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
Upon using: display(dbutils.fs.mounts()) I can see the bucket mounted to:
MountInfo(mountPoint='/mnt/train_test_small', source='sensored_bucket_name/video_topic_modelling/data/train_test_small', encryptionType='')
I then try to access this mounted directory through keras's flow_from_directory() module using the following piece of code:
# create extra partition of the training data as a validation set
train_datagen=ImageDataGenerator(preprocessing_function=preprocess_input, validation_split=0) #included in our dependencies
# set scaling to most common shapes
train_generator=train_datagen.flow_from_directory('/mnt/train_test_small',
target_size=(320, 240),
color_mode='rgb',
batch_size=96,
class_mode='categorical',
subset='training')
#shuffle=True)
validation_generator=train_datagen.flow_from_directory('/mnt/train_test_small',
target_size=(320, 240),
color_mode='rgb',
batch_size=96,
class_mode='categorical',
subset='validation')
However this gives me the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/train_test_small/train/'
I tried to figure this out using keras and databricks documentation but got no further. My best guess at the moment right now is that the keras flow_from_directory() is unable to detect mounted directories but I am not sure.
Anyone out there who does know how to apply the .flow_from_directory() module on a s3 mounted directory in databricks or who knows a good alternative? Help would be much appreciated!
I think you may be missing one more directory level indication to the flow_from_directory. From Keras documentation:
directory: string, path to the target directory. It should contain one subdirectory per class. Any PNG, JPG, BMP, PPM or TIF images inside each of the subdirectories directory tree will be included in the generator.
# set scaling to most common shapes
train_generator=train_datagen.flow_from_directory(
'/mnt/train_test_small/train', # <== add "train" folder
target_size=(320, 240),
...
validation_generator=train_datagen.flow_from_directory(
'/mnt/train_test_small/test', # <== add "test" folder
target_size=(320, 240),
....
Answer found. To access the direct path to the folder add /dbfs/mnt/train_test_small/train/

Importing scripts into a notebook in IBM WATSON STUDIO

I am doing PCA on CIFAR 10 image on IBM WATSON Studio Free version so I uploaded the python file for downloading the CIFAR10 on the studio
pic below.
But when I trying to import cache the following error is showing.
pic below-
After spending some time on google I find a solution but I can't understand it.
link
https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/add-script-to-notebook.html
the solution is as follows:-
Click the Add Data icon (Shows the Add Data icon), and then browse the script file or drag it into your notebook sidebar.
Click in an empty code cell in your notebook and then click the Insert to code link below the file. Take the returned string, and write to a file in the file system that comes with the runtime session.
To import the classes to access the methods in a script in your notebook, use the following command:
For Python:
from <python file name> import <class name>
I can't understand this line
` and write to a file in the file system that comes with the runtime session.``
Where can I find the file that comes with runtime session? Where is the file system located?
Can anyone plz help me in this with the details where to find that file
You have the import error because the script that you are trying to import is not available in your Python runtime's local filesystem. The files (cache.py, cifar10.py, etc.) that you uploaded are uploaded to the object storage bucket associated with the Watson Studio project. To use those files you need to make them available to the Python runtime for example by downloading the script to the runtimes local filesystem.
UPDATE: In the meanwhile there is an option to directly insert the StreamingBody objects. This will also have all the required credentials included. You can skip to writing it to a file in the local runtime filesystem section of this answer if you are using insert StreamingBody object option.
Or,
You can use the code snippet below to read the script in a StreamingBody object:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
os_client= ibm_boto3.client(service_name='s3',
ibm_api_key_id='<IBM_API_KEY_ID>',
ibm_auth_endpoint="<IBM_AUTH_ENDPOINT>",
config=Config(signature_version='oauth'),
endpoint_url='<ENDPOINT>')
# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about the possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
streaming_body_1 = os_client.get_object(Bucket='<BUCKET>', Key='cifar.py')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(streaming_body_1, "__iter__"): streaming_body_1.__iter__ = types.MethodType( __iter__, streaming_body_1 )
And then write it to a file in the local runtime filesystem.
f = open('cifar.py', 'wb')
f.write(streaming_body_1.read())
This opens a file with write access and calls the write method to write to the file. You should then be able to simply import the script.
import cifar
Note: You can get the credentials like IBM_API_KEY_ID for the file by clicking on the Insert credentials option on the drop-down menu for your file.
The instructions that op found miss one crucial line of code. I followed them and was able to import modules but wasn't able to use any functions or classes in those modules. This was fixed by closing the files after writing. This part in the instrucitons:
f = open('<myScript>.py', 'wb')
f.write(streaming_body_1.read())
should instead be (at least this works in my case):
f = open('<myScript>.py', 'wb')
f.write(streaming_body_1.read())
f.close()
Hopefully this helps someone.

Resources