How to access files within subfolders of a bucket GCS using Python? - python-3.x

from google.cloud import storage
import os
bucket = client.get_bucket('path to bucket')
The above code connects me to my bucket but I am struggling to connect with a specific folder within the bucket.
I am trying variants of this code, but no luck:
blob = bucket.get_blob("training/bad")
blob = bucket.get_blob("/training/bad")
blob = bucket.get_blob("path to bucket/training/bad")
I am hoping to get access to a list of images within the bad subfolder, but I can't seem to do so.
I don't even fully understand what a blob is despite reading the docs, and sort of winging it based on tutorials.
Thank you.

What you missed is the fact that in GCS objects in a bucket aren't organized in a filesystem-like directory structure/hierarchy, but rather in a flat structure.
A more detailed explanation can be found in How Subdirectories Work (in the gsutil context, true, but the fundamental reason is the same - the GCS flat namespace):
gsutil provides the illusion of a hierarchical file tree atop the
"flat" name space supported by the Google Cloud Storage service. To
the service, the object gs://your-bucket/abc/def.txt is just an object
that happens to have "/" characters in its name. There is no "abc"
directory; just a single object with the given name.
Since there are no (sub)directories in GCS then /training/bad doesn't really exist, so you can't list its content. All you can do is list all the objects in the bucket and select the ones with names/paths that start with /training/bad.

If you would like to find blobs (files) that exist under a specific prefix (subdirectory) you can specify prefix and delimiter arguments to the list_blobs() function
See the following example taken from the Google Listing Objects example (also GitHub snippet)
def list_blobs_with_prefix(bucket_name, prefix, delimiter=None):
"""Lists all the blobs in the bucket that begin with the prefix.
This can be used to list all blobs in a "folder", e.g. "public/".
The delimiter argument can be used to restrict the results to only the
"files" in the given "folder". Without the delimiter, the entire tree under
the prefix is returned. For example, given these blobs:
/a/1.txt
/a/b/2.txt
If you just specify prefix = '/a', you'll get back:
/a/1.txt
/a/b/2.txt
However, if you specify prefix='/a' and delimiter='/', you'll get back:
/a/1.txt
"""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix, delimiter=delimiter)
print('Blobs:')
for blob in blobs:
print(blob.name)
if delimiter:
print('Prefixes:')
for prefix in blobs.prefixes:
print(prefix)

Related

check if a directory or sub directory exists in a bucket in s3 using boto3

I have a s3 structure like s3://my-bucket/data/2020/03/23/01/data.csv
I want to check if s3://my-bucket/data/2020/03/23 exist.
I can check if the CSV file exist but I cant use that because file name might change so I want to check if the folder exists.
This might not be possible, depending what you are expecting.
First, it's worth mentioning that folders do not actually exist in Amazon S3.
For example, you could run this command to copy a file to S3:
aws s3 cp foo.txt s3://my-bucket/data/2020/03/23/
This would put the file in the data/2020/03/23/ path and those four directories would "appear" in the console, but they don't actually exist. Rather, the Key (filename) of the object contains the full path.
If you were then to delete the object:
aws s3 rm s3://my-bucket/data/2020/03/23/foo.txt
then the four directories would "disappear" (because they didn't exist anyway).
It is possible to cheat by clicking "Create Folder" in the S3 management console. This will create a zero-length object with the name of the folder (actually, with the name of the full path). This causes the directory to appear in the bucket listing, but that is purely because an object exists in that path.
Within S3, directories are referred to as CommonPrefixes and commands can be used that reference a prefix, rather than referencing a directory.
So, you could list the bucket, providing the path as a prefix. This would then return a list of any objects that are within that path.
However, the best answer is: Just pretend that it exists and everything will work fine.
What I would do is what John Rotenstein mentioned in a comment: list the contents of a bucket specifying a prefix, which in this case is the path to the directory you are interested in (data/2020/03/23/01).
import boto3
from botocore.exceptions import ClientError
def folder_exists(bucket_name, path_to_folder):
try:
s3 = boto3.client('s3')
res = s3.list_objects_v2(
Bucket=bucket_name,
Prefix=path_to_folder
)
return 'Contents' in res
except ClientError as e:
# Logic to handle errors.
raise e
As of now, the list_objects_v2's response dictionary will not have a 'Contents' key if the prefix is not found (docs).

How to rename files, as they are copied to a new location, with adf

I am performing copy activity from cosmosDB to Blob storage, collections will be copied to storage as files. I want those filenames to be renamed with "collectionname-date". The file should have name, followed by date and time as suffix to that. How can I achieve this?
I have to say i can't find any ways to get the collection name dynamically,but i implement other your requirements. Please see my configurations:
1.Cosmos db dataset:
as normal to set
2.Blob Storage dataset:
configure a parameter for it:
Then configure the dynamic file path:
Pass the collection static name(for me is coll) for the fileName param.
3.Output in Blob Storage:

Read files from Cloud Storage having definite prefix but random postfix

I am using the following code to read the contents of a file in Google Cloud Storage from Cloud Functions. Here the name of file (filename) is defined. I now have files that will have a definite prefix but the postfix can be anything.
Example - ABC-khasvbdjfy7i76.csv
How to read the contents of such files?
I know there will be "ABC" as a prefix. But the postfix can be anything random.
storage_client = storage.Client()
bucket = storage_client.get_bucket('test-bucket')
blob = bucket.blob(filename)
contents = blob.download_as_string()
print("Contents : ")
print(contents)
You can use prefix parameter of list_blobs method to filter objects beginning with your prefix, and iterate on the objects :
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('test-bucket')
blobs = bucket.list_blobs(prefix="ABC")
for blob in blobs:
contents = blob.download_as_string()
print("Contents of %s:" % blob.name)
print(contents)
You need to know the entire path of a file to be able to read it. And since the client can't guess the random suffix, you will first have to list all the files with the non-random prefix.
There is a list operation, to which you can pass a prefix, as shown here: Google Cloud Storage + Python : Any way to list obj in certain folder in GCS?

How to pull only certain csv's and concat the data from s3?

I have a bucket with various files. I am only interested in pulling files that begin with the word 'member' and storing each member file in a list to be concated further into a dataframe.
Currently I am pulling data like this:
import boto3
my_bucket = s3.Bucket('my-bucket')
obj = s3.Object('my-bucket','member')
file_content = obj.get()['Body'].read().decode('utf-8')
df = pd.read_csv(file_content)
How ever this is only pulling the member file. I have member files that look like this 'member_1229013','member_2321903' etc.
How can I read in all the 'member' files, save the data in a list so I can concat later. All column names are the same in all csv's
You can only download/access one object per API call.
I normally recommend downloading the objects to a local directory, and then accessing them as normal local files. Here is an example of how to download an object from Amazon S3:
import boto3
s3 = boto3.client('s3')
s3.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
See: download_file() documentation
If you want to read multiple files, you will first need to obtain a listing of the files (eg with list_objects_v2(), and then access each object individually.
One tip for boto3... There are two ways to make calls: via a Resource (eg using s3.Object() or s3.Bucket()) or via a Client, which passes everything as parameters.

How do you iterate through objects in a blob on google cloud storage? Python

I am trying to figure out how to iterate over objects in a blob in google cloud storage. The address is similar to this:
gs://project_ID/bucket_name/DIRECTORY/file1
gs://project_ID/bucket_name/DIRECTORY/file2
gs://project_ID/bucket_name/DIRECTORY/file3
gs://project_ID/bucket_name/DIRECTORY/file4
...
The DIRECTORY on the GCS bucket has a bunch of different files that I need to iterate over, so that I can check when it was last updated (to see if it is a new file there) so that I can pull the contents.
Example function
def getNewFiles():
storage_client = storage.Client(project='project_ID')
try:
bucket = storage_client.get_bucket('bucket_name')
except:
storage_client.create_bucket(bucket_name)
for blob in bucket.list_blobs(prefix='DIRECTORY'):
if blob.name == 'DIRECTORY/':
**Iterate through this Directory**
**CODE NEEDED HERE***
**Figure out how to iterate through all files here**
I have gone through the python api and the client library, and can't find any examples of this working..
According to Google Cloud Client Library for Python docs, blob.name:
This corresponds to the unique path of the object in the bucket
Therefore blob.name will return something like this:
DIRECTORY/file1
If you are already including the parameter prefix='DIRECTORY' when using the list_blobs() method you can get all your files in your directory by doing:
for blob in bucket.list_blobs(prefix='DIRECTORY'):
print(blob.name)
You can use something like blob.name.lstrip('DIRECTORY') or the standard library re module to clean the string and get only the file name.
However, according to what you said: "so that I can check when it was last updated (to see if it is a new file there)" if you are looking for some function to be triggered when you have new files in your bucket, you can use Google Cloud Functions. You have the docs here on how to use them with Cloud Storage when new objects are created. Although as of current date (Feb/2018) you can only write Cloud Functions using NODE.JS

Resources