Use Python to process images in Azure blob storage - python-3.x

I have 1000s of images sitting in a container on my blob storage. I want to process these images one by one in Python and spit out the new images out into a new container (the process is basically detecting and redacting objects). Downloading the images locally is not an option because they take up way too much space.
So far, I have been able to connect to the blob and have created a new container to store the processed images in, but I have no idea how to run the code to process the pictures and save them to the new container. Can anyone help with this?
Code so far is:
from azure.storage.file import FileService
from azure.storage.blob import BlockBlobService
# call blob service for the storage acct
block_blob_service = BlockBlobService(account_name = 'mycontainer', account_key = 'HJMEchn')
# create new container to store processed images
container_name = 'new_images'
block_blob_service.create_container(container_name)
Do I need to use get_blob_to_stream or get_blob_to_path from here:
https://azure-storage.readthedocs.io/ref/azure.storage.blob.baseblobservice.html so I don't have to download the images?
Any help would be much appreciated!

As mentioned in the comment, you may need to download or stream your blobs and then upload the results after processing them to your new container.
You could refer to the samples to download and upload blobs as below.
Download the blobs:
# Download the blob(s).
# Add '_DOWNLOADED' as prefix to '.txt' so you can see both files in Documents.
full_path_to_file2 = os.path.join(local_path, string.replace(local_file_name ,'.txt', '_DOWNLOADED.txt'))
print("\nDownloading blob to " + full_path_to_file2)
block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)
Upload blobs to the container:
# Create a file in Documents to test the upload and download.
local_path=os.path.expanduser("~\Documents")
local_file_name ="QuickStart_" + str(uuid.uuid4()) + ".txt"
full_path_to_file =os.path.join(local_path, local_file_name)
# Write text to the file.
file = open(full_path_to_file, 'w')
file.write("Hello, World!")
file.close()
print("Temp file = " + full_path_to_file)
print("\nUploading to Blob storage as blob" + local_file_name)
# Upload the created file, use local_file_name for the blob name
block_blob_service.create_blob_from_path(container_name, local_file_name, full_path_to_file)
Update:
Try to use the code by stream as below, for more details you could see the two links: link1 and link2 (they are related issue, you could see them together).
from azure.storage.blob import BlockBlobService
from io import BytesIO
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Do whatever you want to do - here I am just copying the input stream to the output stream
copyfileobj(input_blob, output_blob)
...
# Create the a new blob
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
# Or update the same blob
block_blob_service.create_blob_from_stream('mycontainer', 'myinputfilename', output_blob)

Related

Fastest way to combine multiple CSV files from a blob storage container into one CSV file on another blob storage container in an Azure function

I'd like to hear if it's possible can improve the code below to make it run faster (and maybe cheaper) as part of an Azure function for combining multiple CSV files from a source blob storage container into one CSV file on a target blob storage container on Azure by using Python (please note that it would also be fine for me to use another library than pandas if need be)?
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames))
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)
Instead of using Azure Function, you can use Azure Data Factory to concatenate your files.
It will probably have better efficiency with ADF than Azure Functions with pandas.
Take a look at this blog post https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
If you want to use Azure function, try to concatenate files without using pandas. If all your files have the same columns and the same column order, you can concatenate string directly and remove the header line, if any, of all files but the first.

Save and load a spacy model to a google cloud storage bucket

I have a spacy model and I am trying to save it to a gcs bucket using this format
trainer.to_disk('gs://{bucket-name}/model')
But each time I run this I get this error message
FileNotFoundError: [Errno 2] No such file or directory: 'gs:/{bucket-name}/model'
Also when I create a kubeflow persistent volume and save the model there I can download the model using trainer.load('model') I get this error message
File "/usr/local/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/usr/local/lib/python3.7/site-packages/spacy/util.py", line 175, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model '/model/'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
I don't understand why I am having these errors as this works perfectly when I run this on my pc locally and use a local path.
Cloud storage is not a local disk or a physical storage unit where you can save things directly to.
As you say
this on my pc locally and use a local path
Cloud Storage is virtually not a local path of any other tool in the cloud
If you are using python you will have to create a client using the Storage library and then upload your file using upload_blob i.e.:
from google.cloud import storage
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# bucket_name = "your-bucket-name"
# source_file_name = "local/path/to/file"
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
Since you've tagged this question "kubeflow-pipelines", I'll answer from that perspective.
KFP strives to be platform-agnostic. Most good components are cloud-independent.
KFP promotes system-managed artifact passing where the components code only writes output data to local files and the system takes it and makes it available for other components.
So, it's best to describe your SpaCy model trainer that way - to write data to local files. Check how all other components work, for example, Train Keras classifier.
Since you want to upload to GCS, do that explicitly, but passing the model output of your trainer to an "Upload to GCS" component:
upload_to_gcs_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/616542ac0f789914f4eb53438da713dd3004fba4/components/google-cloud/storage/upload_to_explicit_uri/component.yaml')
def my_pipeline():
model = train_specy_model(...).outputs['model']
upload_to_gcs_op(
data=model,
gcs_path='gs:/.....',
)
The following implementation assumes you have gsutil installed in your computer. The spaCy version used was 3.2.4. In my case, I wanted everything to be part of a (demo) single Python file, spacy_import_export.py. To do so, I had to use subprocess python library, plus this comment, as follows:
# spacy_import_export.py
import spacy
import subprocess # Will be used later
# spaCy models trained by user, are always stored as LOCAL directories, with more subdirectories and files in it.
PATH_TO_MODEL = "/home/jupyter/" # Use your own path!
# Test-loading your "trainer" (optional step)
trainer = spacy.load(PATH_TO_MODEL+"model")
# Replace 'bucket-name' with the one of your own:
bucket_name = "destination-bucket-name"
GCS_BUCKET = "gs://{}/model".format(bucket_name)
# This does the trick for the UPLOAD to Cloud Storage:
# TIP: Just for security, check Cloud Storage afterwards: "model" should be in GCS_BUCKET
subprocess.run(["gsutil", "-m", "cp", "-r", PATH_TO_MODEL+"model", GCS_BUCKET])
# This does the trick for the DOWNLOAD:
# HINT: By now, in PATH_TO_MODEL, you should have a "model" & "downloaded_model"
subprocess.run(["gsutil", "-m", "cp", "-r", GCS_BUCKET+MODEL_NAME+"/*", PATH_TO_MODEL+"downloaded_model"])
# Test-loading your "GCS downloaded model" (optional step)
nlp_original = spacy.load(PATH_TO_MODEL+"downloaded_model")
I apologize for the excess of comments, I just wanted to make everything clear, for "spaCy newcomers". I know it is a bit late, but hope it helps.

Upload images to cloud and then paste the respective link to a respective dataframe

I've PDFs with tables and the image diagram related to the content of tables.
Both, table and image on a single page.
I've extracted the Tables using the Camelot library. And also images using Fitz library. Using Python
Now I want to upload those images(.png) to any possible cloud service and provide the web link of the respective image to the Dataframe of the respective table.
Please help.
This is how a single Page of PDF looks line.
In case of any public cloud, you can use S3 to store images using BOTO3 (python library).
sample code to store images in AWS S3 bucket:
import boto3
s3 = boto3.client('s3')
bucket = 'your-bucket-name'
file_name = 'location-of-your-image'
key_name = 'name-of-image-in-s3'
s3.upload_file(file_name, bucket, key_name)
To obtain the uploaded file url, you can construct it as:
s3_url = f"https://{bucket}.s3.{region}.amazonaws.com/{file_name}"
and store s3_url in dataframe.

Read files from Cloud Storage having definite prefix but random postfix

I am using the following code to read the contents of a file in Google Cloud Storage from Cloud Functions. Here the name of file (filename) is defined. I now have files that will have a definite prefix but the postfix can be anything.
Example - ABC-khasvbdjfy7i76.csv
How to read the contents of such files?
I know there will be "ABC" as a prefix. But the postfix can be anything random.
storage_client = storage.Client()
bucket = storage_client.get_bucket('test-bucket')
blob = bucket.blob(filename)
contents = blob.download_as_string()
print("Contents : ")
print(contents)
You can use prefix parameter of list_blobs method to filter objects beginning with your prefix, and iterate on the objects :
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('test-bucket')
blobs = bucket.list_blobs(prefix="ABC")
for blob in blobs:
contents = blob.download_as_string()
print("Contents of %s:" % blob.name)
print(contents)
You need to know the entire path of a file to be able to read it. And since the client can't guess the random suffix, you will first have to list all the files with the non-random prefix.
There is a list operation, to which you can pass a prefix, as shown here: Google Cloud Storage + Python : Any way to list obj in certain folder in GCS?

Store and read a file in a sqlite3 database with SQLAlchemy in Python3

I want to store and read a file (e.g. myfile.dat) from my hard drive into a sqlite3 database accessing over SQLAlchemy with Python3.
I want to store one picture for each row of Person-table. I just want to show that picture in a GUI when the data of that Person are shown.
Simply read the file in binary-mode from hard drive and store it as a BLOB to the database.
with open('image.png', 'rb') as f:
fcontent = f.read()
Get the BLOB (as image) from database and feed it to a Python-byte stream.
Then you can use it e.g. as a Stream-Object in wxPython.
# read the BLOB as 'image' from database
# and use it
stream = io.BytesIO(image)
image = wx.Image(stream)
bitmap = wx.Bitmap(image)

Resources