Read files from Cloud Storage having definite prefix but random postfix - python-3.x

I am using the following code to read the contents of a file in Google Cloud Storage from Cloud Functions. Here the name of file (filename) is defined. I now have files that will have a definite prefix but the postfix can be anything.
Example - ABC-khasvbdjfy7i76.csv
How to read the contents of such files?
I know there will be "ABC" as a prefix. But the postfix can be anything random.
storage_client = storage.Client()
bucket = storage_client.get_bucket('test-bucket')
blob = bucket.blob(filename)
contents = blob.download_as_string()
print("Contents : ")
print(contents)

You can use prefix parameter of list_blobs method to filter objects beginning with your prefix, and iterate on the objects :
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('test-bucket')
blobs = bucket.list_blobs(prefix="ABC")
for blob in blobs:
contents = blob.download_as_string()
print("Contents of %s:" % blob.name)
print(contents)

You need to know the entire path of a file to be able to read it. And since the client can't guess the random suffix, you will first have to list all the files with the non-random prefix.
There is a list operation, to which you can pass a prefix, as shown here: Google Cloud Storage + Python : Any way to list obj in certain folder in GCS?

Related

Fastest way to combine multiple CSV files from a blob storage container into one CSV file on another blob storage container in an Azure function

I'd like to hear if it's possible can improve the code below to make it run faster (and maybe cheaper) as part of an Azure function for combining multiple CSV files from a source blob storage container into one CSV file on a target blob storage container on Azure by using Python (please note that it would also be fine for me to use another library than pandas if need be)?
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames))
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)
Instead of using Azure Function, you can use Azure Data Factory to concatenate your files.
It will probably have better efficiency with ADF than Azure Functions with pandas.
Take a look at this blog post https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
If you want to use Azure function, try to concatenate files without using pandas. If all your files have the same columns and the same column order, you can concatenate string directly and remove the header line, if any, of all files but the first.

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

Use Python to process images in Azure blob storage

I have 1000s of images sitting in a container on my blob storage. I want to process these images one by one in Python and spit out the new images out into a new container (the process is basically detecting and redacting objects). Downloading the images locally is not an option because they take up way too much space.
So far, I have been able to connect to the blob and have created a new container to store the processed images in, but I have no idea how to run the code to process the pictures and save them to the new container. Can anyone help with this?
Code so far is:
from azure.storage.file import FileService
from azure.storage.blob import BlockBlobService
# call blob service for the storage acct
block_blob_service = BlockBlobService(account_name = 'mycontainer', account_key = 'HJMEchn')
# create new container to store processed images
container_name = 'new_images'
block_blob_service.create_container(container_name)
Do I need to use get_blob_to_stream or get_blob_to_path from here:
https://azure-storage.readthedocs.io/ref/azure.storage.blob.baseblobservice.html so I don't have to download the images?
Any help would be much appreciated!
As mentioned in the comment, you may need to download or stream your blobs and then upload the results after processing them to your new container.
You could refer to the samples to download and upload blobs as below.
Download the blobs:
# Download the blob(s).
# Add '_DOWNLOADED' as prefix to '.txt' so you can see both files in Documents.
full_path_to_file2 = os.path.join(local_path, string.replace(local_file_name ,'.txt', '_DOWNLOADED.txt'))
print("\nDownloading blob to " + full_path_to_file2)
block_blob_service.get_blob_to_path(container_name, local_file_name, full_path_to_file2)
Upload blobs to the container:
# Create a file in Documents to test the upload and download.
local_path=os.path.expanduser("~\Documents")
local_file_name ="QuickStart_" + str(uuid.uuid4()) + ".txt"
full_path_to_file =os.path.join(local_path, local_file_name)
# Write text to the file.
file = open(full_path_to_file, 'w')
file.write("Hello, World!")
file.close()
print("Temp file = " + full_path_to_file)
print("\nUploading to Blob storage as blob" + local_file_name)
# Upload the created file, use local_file_name for the blob name
block_blob_service.create_blob_from_path(container_name, local_file_name, full_path_to_file)
Update:
Try to use the code by stream as below, for more details you could see the two links: link1 and link2 (they are related issue, you could see them together).
from azure.storage.blob import BlockBlobService
from io import BytesIO
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Do whatever you want to do - here I am just copying the input stream to the output stream
copyfileobj(input_blob, output_blob)
...
# Create the a new blob
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
# Or update the same blob
block_blob_service.create_blob_from_stream('mycontainer', 'myinputfilename', output_blob)

How to access files within subfolders of a bucket GCS using Python?

from google.cloud import storage
import os
bucket = client.get_bucket('path to bucket')
The above code connects me to my bucket but I am struggling to connect with a specific folder within the bucket.
I am trying variants of this code, but no luck:
blob = bucket.get_blob("training/bad")
blob = bucket.get_blob("/training/bad")
blob = bucket.get_blob("path to bucket/training/bad")
I am hoping to get access to a list of images within the bad subfolder, but I can't seem to do so.
I don't even fully understand what a blob is despite reading the docs, and sort of winging it based on tutorials.
Thank you.
What you missed is the fact that in GCS objects in a bucket aren't organized in a filesystem-like directory structure/hierarchy, but rather in a flat structure.
A more detailed explanation can be found in How Subdirectories Work (in the gsutil context, true, but the fundamental reason is the same - the GCS flat namespace):
gsutil provides the illusion of a hierarchical file tree atop the
"flat" name space supported by the Google Cloud Storage service. To
the service, the object gs://your-bucket/abc/def.txt is just an object
that happens to have "/" characters in its name. There is no "abc"
directory; just a single object with the given name.
Since there are no (sub)directories in GCS then /training/bad doesn't really exist, so you can't list its content. All you can do is list all the objects in the bucket and select the ones with names/paths that start with /training/bad.
If you would like to find blobs (files) that exist under a specific prefix (subdirectory) you can specify prefix and delimiter arguments to the list_blobs() function
See the following example taken from the Google Listing Objects example (also GitHub snippet)
def list_blobs_with_prefix(bucket_name, prefix, delimiter=None):
"""Lists all the blobs in the bucket that begin with the prefix.
This can be used to list all blobs in a "folder", e.g. "public/".
The delimiter argument can be used to restrict the results to only the
"files" in the given "folder". Without the delimiter, the entire tree under
the prefix is returned. For example, given these blobs:
/a/1.txt
/a/b/2.txt
If you just specify prefix = '/a', you'll get back:
/a/1.txt
/a/b/2.txt
However, if you specify prefix='/a' and delimiter='/', you'll get back:
/a/1.txt
"""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix, delimiter=delimiter)
print('Blobs:')
for blob in blobs:
print(blob.name)
if delimiter:
print('Prefixes:')
for prefix in blobs.prefixes:
print(prefix)

How do you iterate through objects in a blob on google cloud storage? Python

I am trying to figure out how to iterate over objects in a blob in google cloud storage. The address is similar to this:
gs://project_ID/bucket_name/DIRECTORY/file1
gs://project_ID/bucket_name/DIRECTORY/file2
gs://project_ID/bucket_name/DIRECTORY/file3
gs://project_ID/bucket_name/DIRECTORY/file4
...
The DIRECTORY on the GCS bucket has a bunch of different files that I need to iterate over, so that I can check when it was last updated (to see if it is a new file there) so that I can pull the contents.
Example function
def getNewFiles():
storage_client = storage.Client(project='project_ID')
try:
bucket = storage_client.get_bucket('bucket_name')
except:
storage_client.create_bucket(bucket_name)
for blob in bucket.list_blobs(prefix='DIRECTORY'):
if blob.name == 'DIRECTORY/':
**Iterate through this Directory**
**CODE NEEDED HERE***
**Figure out how to iterate through all files here**
I have gone through the python api and the client library, and can't find any examples of this working..
According to Google Cloud Client Library for Python docs, blob.name:
This corresponds to the unique path of the object in the bucket
Therefore blob.name will return something like this:
DIRECTORY/file1
If you are already including the parameter prefix='DIRECTORY' when using the list_blobs() method you can get all your files in your directory by doing:
for blob in bucket.list_blobs(prefix='DIRECTORY'):
print(blob.name)
You can use something like blob.name.lstrip('DIRECTORY') or the standard library re module to clean the string and get only the file name.
However, according to what you said: "so that I can check when it was last updated (to see if it is a new file there)" if you are looking for some function to be triggered when you have new files in your bucket, you can use Google Cloud Functions. You have the docs here on how to use them with Cloud Storage when new objects are created. Although as of current date (Feb/2018) you can only write Cloud Functions using NODE.JS

Resources