GCP - get full information about bucket - python-3.x

I need to get the file information stored in Google Bucket. Information Like Filesize, Storage Class, Last Modified, Type. I searched for the google docs but it can be done by curl or console method. I need to get that information from the Python API like downloading the blob, uploading the blob to the bucket. Sample code or any help is appreciated!!

To get the object metadata you can use the following code:
from google.cloud import storage
def object_metadata(bucket_name, blob_name):
"""Prints out a blob's metadata."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.get_blob(blob_name)
print('Blob: {}'.format(blob.name))
print('Bucket: {}'.format(blob.bucket.name))
print('Storage class: {}'.format(blob.storage_class))
print('ID: {}'.format(blob.id))
print('Size: {} bytes'.format(blob.size))
print('Updated: {}'.format(blob.updated))
print('Generation: {}'.format(blob.generation))
print('Metageneration: {}'.format(blob.metageneration))
print('Etag: {}'.format(blob.etag))
print('Owner: {}'.format(blob.owner))
print('Component count: {}'.format(blob.component_count))
print('Crc32c: {}'.format(blob.crc32c))
print('md5_hash: {}'.format(blob.md5_hash))
print('Cache-control: {}'.format(blob.cache_control))
print('Content-type: {}'.format(blob.content_type))
print('Content-disposition: {}'.format(blob.content_disposition))
print('Content-encoding: {}'.format(blob.content_encoding))
print('Content-language: {}'.format(blob.content_language))
print('Metadata: {}'.format(blob.metadata))
object_metadata('bucketName', 'objectName')

Using the Cloud Storage client library, and checking at the docs for buckets, you can do this to get the Storage Class:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('YOUR_BUCKET')
print(bucket.storage_class)
As for the size and last modified files (at least it's what I understood from your question), those belong to the files itself. You could iterate the list of blobs in your bucket and check that:
for blob in bucket.list_blobs():
print(blob.size)
print(blob.updated)

Related

Loading multiple file from cloud storage to big query in different tables

I am new to GCP, I am able to get 1 file into GCS from my VM and then transfer it to bigquery.
How to I transfer multiple files from GCS to Bigquery. I know wildcard URi is the solution to it but what other changes are also needed in the code below?
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print(f"Processing file: {file['name']}.")
As there could be multiple uploads so I cannot define the specific table name or file name? Is it possible to do this task automatically?
This function is triggered by PubSub whenever there is a new file in GCS bucket.
Thanks
To transfer multiple files from GCS to Bigquery, you can simply loop through all the files. A sample of the working code with comments is below.
I believe event and context (function arguments) are handled by Google cloud function by default, so no need to modify that part. Or you can simplify the code by leveraging event instead of a loop.
def hello_gcs(event, context):
import re
from google.cloud import storage
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
bq_client = bigquery.Client()
bucket = storage.Client().bucket("bucket-name")
for blob in bucket.list_blobs(prefix="folder-name/"):
if ".csv" in blob.name: #Checking for csv blobs as list_blobs also returns folder_name
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
)
csv_filename = re.findall(r".*/(.*).csv",blob.name) #Extracting file name for BQ's table id
bq_table_id = "project-name.dataset-name."+csv_filename[0] # Determining table name
try: #Check if the table already exists and skip uploading it.
bq_client.get_table(bq_table_id)
print("Table {} already exists. Not uploaded.".format(bq_table_id))
except NotFound: #If table is not found, upload it.
uri = "gs://bucket-name/"+blob.name
print(uri)
load_job = bq_client.load_table_from_uri(
uri, bq_table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = bq_client.get_table(bq_table_id) # Make an API request.
print("Table {} uploaded.".format(bq_table_id))
Correct me if I am wrong, I understand that your cloud function is triggered by a finalize event (Google Cloud Storage Triggers), when a new file (or object) appears in a storage bucket. It means that there is one event for each "new" object in the bucket. Thus, at least one invocation of the cloud function for every object.
The link above has an example of data which comes in the event dictionary. Plenty of information there including details of the object (file) to be loaded.
You might like to have some configuration with mapping between a file name pattern and a target BigQuery table for data loading, for example. Using that map you will be able to make a decision on which table should be used for loading. Or you may have some other mechanism for choosing the target table.
Some other things to think about:
Exception handling - what are you going to do with the file if the
data is not loaded (for any reason)? Who and how is to be informed?
What is to be done to (correct the source data or the target table
and) repeat the loading, etc.
What happens if the loading takes more time, than a cloud function
timeout (maximum 540 seconds at the present moment)?
What happens if the there are more than one cloud function
invocations from one finalize event, or from different events but
from semantically the same source file (repeated data, duplications,
etc.)
Don't answer to me, just think about such cases if you have not done it yet.
if your Data source is GCS and your destination is BQ you can use BigQuery Data Transfer Service to ETL your data in BQ. every Transfer job is for a certain Table and you can select if you want to append or overwrite data in a certain Table with Streaming mode.
You can schedule this job as well. Dialy, weekly,..etc.
To load multiple GCS files onto multiple BQ tables on a single Cloud Function invocation, you’d need to list those files and then iterate over them, creating a load job for each file, just as you have done for one. But doing all that work inside a single function call, kind of breaks the purpose of using Cloud Functions.
If your requirements do not force you to do so, you can leverage the power of Cloud Functions and let a single CF be triggered by each of those files once they are added to the bucket as it is an event driven function. Please refer https://cloud.google.com/functions/docs/writing/background#cloud-storage-example. It would be triggered every time there is a specified activity, for which there would be event metadata.
So, in your application rather than taking the entire bucket contents in the URI, we can take the name of the file which triggered the event and load only that file into a bigquery table as shown in the below code sample.
Here is how you can resolve the issue in your code. Try the following changes in your code.
You can extract the details about the event and detail about the file which triggered the event from the cloud function event dictionary. In your case, we can get the file name as event[‘name’] and update the “uri” variable.
Generate a new unique table_id (here as an example the table_id is the same as the file name). You can use other schemes to generate unique file names as required.
Refer the code below
def hello_gcs(event, context):
from google.cloud import bigquery
client = bigquery.Client() # Construct a BigQuery client object.
print(f"Processing file: {event['name']}.") #name of the file which triggers the function
if ".csv" in event['name']:
# bq job config
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
)
file_name = event['name'].split('.')
table_id = "<project_id>.<dataset_name>."+file_name[0] #[generating new id for each table]
uri = "gs://<bucket_name>/"+event['name']
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Table {} uploaded.".format(table_id))

Firebase Storage remove custom metadata key

I couldn't remove a custom metadata key from a file in Firebase storage.
This is what I tried so far:
blob = bucket.get_blob("dir/file")
metadata = blob.metadata
metadata.pop('custom_key', None) # or del metadata['custom_key']
blob.metadata = metadata
blob.patch()
I also tried to set its value to None but it didn't help.
It seems that there are some reasons that could be affecting you to delete the custom metadata. I will address them individually, so it's easier for understanding.
First, it seems that when you read the metadata with blob.metadata, it only returns as a read-only - as clarified here. So, your updates will not work as you would like, using the way you are trying. The second reason, it seems that saving the metadata again back to blob, follows a different order than what you are trying - as shown here.
You can give it a try using the below code:
blob = bucket.get_blob("dir/file")
metadata = blob.metadata
metadata.pop{'custom_key': None}
blob.patch()
blob.metadata = metadata
While this code is untested, I believe it might help you changing the orders and avoid the blob.metadata read-only situation.
In case this doesn't help you, I would recommend you to raise an issue for in the official Github repository for the Python library on Cloud Storage, for further clarifications from the developers.

How to trigger a python script whenever new data is loaded into a s3 bucket?

I am trying to pull down data from an s3 bucket that gets new records by the second. Data comes in at 250+ G per hour. I am creating a Python script that will be running continuously to collect new data loads in real-time by the seconds.
Here is the structure of the s3 bucket keys:
o_key=7111/year=2020/month=8/day=11/hour=16/minute=46/second=9/ee9.jsonl.gz
o_key=7111/year=2020/month=8/day=11/hour=16/minute=40/second=1/ee99999.jsonl.gz
I am using Boto3 to try and attempt this and here is what I have so far:
s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, verify=False)
s3_bucket = s3_resource.Bucket(BUCKET_NAME)
files = s3_bucket.objects.filter()
files = [obj.key for obj in sorted(files, key=lambda x: x.last_modified, reverse=True)]
for x in files:
print(x)
This outputs all the keys that are in that bucket and sorts by the last_modified data. However is there a way to pause the script until new data is loaded and then process that data and so on by the second? There could be 20 second delays when loaded in the new data so that is another thing that is giving me troubles when forming the logic. Any ideas or suggestions would help.
s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, verify=False)
s3_bucket = s3_resource.Bucket(BUCKET_NAME)
files = s3_bucket.objects.filter()
while list(files): #check if the key exists
if len(objs) > 0 and objs[0].key == key:
print("Exists!")
else:
time.sleep(.1) #sleep until the next key is there
continue
This is another approach i tried but isn't working to well. I am trying to sleep whenever there is no next data and then process the new data once it is loaded.
The Amazon S3 notification feature enables you to receive notifications when certain events happen in your bucket. To enable notifications, you must first add a notification configuration that identifies the events you want Amazon S3 to publish and the destinations where you want Amazon S3 to send the notifications.
You store this configuration in the notification subresource that is associated with a bucket. - Typically in Lambda...
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
Hope this helps
r0ck

Python Script to Return Blob File URI

I am looking for help with a script to run in Python that will access a preexisting blob in Azure, and return the entire URI of a single file that will be in this blob.
I've tried a few things but they are either too old for the current libraries, or don't connect when ran.
I'm a beginner, so any help waking me through this would be much appreciated.This is what i was trying but I'm sure this is all wrong
import os, uuid
import azure.storage.blob
self.blob_type = _BlobTypes.BlockBlob
super(BlockBlobService, self).test(
account_name=Account_Name,
account_key='Account key',
sas_token=None,
is_emulated=False,
protocol='https',
endpoint_suffix='core.windows.net',
custom_domain=None,
request_session=None,
connection_string=None,
socket_timeout=None,
token_credential=None)
get_block_list(
ContainerName,
BlobName,
snapshot=None,
block_list_type=None,
lease_id=None,
timeout=None)
From your description, suppose you want to get the blob url and the block list is uncommitted. If yes you could use make_blob_url method to implement it, this could retriever the blob url even the blob doesn't exist.
Below is my test code, firstly I create block_list but uncommitted, this could get the blob url, however even you could get the blob url, this url is not accessible cause the blob doesn't exist.
I use the azure-storage-blob==2.1.0.
from azure.storage.blob import BlockBlobService, PublicAccess,ContentSettings,BlockListType,BlobBlock
connect_str ='connection string'
block_blob_service = BlockBlobService(connection_string=connect_str)
containername='test'
blobname='abc-test.txt'
block_blob_service.put_block(container_name=containername,blob_name=blobname,block=b'AAA',block_id=1)
block_blob_service.put_block(container_name=containername,blob_name=blobname,block=b'BBB',block_id=2)
block_blob_service.put_block(container_name=containername,blob_name=blobname,block=b'CCC',block_id=3)
block_list=block_blob_service.get_block_list(container_name=containername,blob_name=blobname,block_list_type=BlockListType.All)
uncommitted = len(block_list.uncommitted_blocks)
print(uncommitted)
exists=block_blob_service.exists(container_name=containername,blob_name=blobname)
print(exists)
blob_url=block_blob_service.make_blob_url(container_name=containername,blob_name=blobname)
print(blob_url)
block_list = [BlobBlock(id='1'), BlobBlock(id='2'), BlobBlock(id='3')]
block_blob_service.put_block_list(container_name=containername,blob_name=blobname,block_list=block_list)
exists=block_blob_service.exists(container_name=containername,blob_name=blobname)
print(exists)
blob_url=block_blob_service.make_blob_url(container_name=containername,blob_name=blobname)
print(blob_url)
Hope this is what you want, if you still have other problem please feel free to let me know.

Is it possible to filter AWS S3 objects based on certain metadata entry?

I am using Python 3.6 and boto3 library to work with some objects in s3 bucket. I have created some S3 objects with metadata entries. For example,
bucketName = 'Boto3'
objectKey = 'HelloBoto.txt'
metadataDic = {'MetadataCreator':"Ehxn"}
Now I am wondering if it is possible to filter and get only those objects which have a certain metadata entry, for example,
for obj in s3Resource.Bucket(bucketName).objects.filter(Metadata="Ehsan ul haq"):
print('{0}'.format(obj.key))
No. The list_objects() command does not accept a filter.
You would need to call head_object() to obtain the metadata for each individual object.
Alternatively, you could activate Amazon S3 Inventory - Amazon Simple Storage Service, which can provide a daily listing of all objects with metadata.

Resources