Python Script to Return Blob File URI - azure

I am looking for help with a script to run in Python that will access a preexisting blob in Azure, and return the entire URI of a single file that will be in this blob.
I've tried a few things but they are either too old for the current libraries, or don't connect when ran.
I'm a beginner, so any help waking me through this would be much appreciated.This is what i was trying but I'm sure this is all wrong
import os, uuid
import azure.storage.blob
self.blob_type = _BlobTypes.BlockBlob
super(BlockBlobService, self).test(
account_name=Account_Name,
account_key='Account key',
sas_token=None,
is_emulated=False,
protocol='https',
endpoint_suffix='core.windows.net',
custom_domain=None,
request_session=None,
connection_string=None,
socket_timeout=None,
token_credential=None)
get_block_list(
ContainerName,
BlobName,
snapshot=None,
block_list_type=None,
lease_id=None,
timeout=None)

From your description, suppose you want to get the blob url and the block list is uncommitted. If yes you could use make_blob_url method to implement it, this could retriever the blob url even the blob doesn't exist.
Below is my test code, firstly I create block_list but uncommitted, this could get the blob url, however even you could get the blob url, this url is not accessible cause the blob doesn't exist.
I use the azure-storage-blob==2.1.0.
from azure.storage.blob import BlockBlobService, PublicAccess,ContentSettings,BlockListType,BlobBlock
connect_str ='connection string'
block_blob_service = BlockBlobService(connection_string=connect_str)
containername='test'
blobname='abc-test.txt'
block_blob_service.put_block(container_name=containername,blob_name=blobname,block=b'AAA',block_id=1)
block_blob_service.put_block(container_name=containername,blob_name=blobname,block=b'BBB',block_id=2)
block_blob_service.put_block(container_name=containername,blob_name=blobname,block=b'CCC',block_id=3)
block_list=block_blob_service.get_block_list(container_name=containername,blob_name=blobname,block_list_type=BlockListType.All)
uncommitted = len(block_list.uncommitted_blocks)
print(uncommitted)
exists=block_blob_service.exists(container_name=containername,blob_name=blobname)
print(exists)
blob_url=block_blob_service.make_blob_url(container_name=containername,blob_name=blobname)
print(blob_url)
block_list = [BlobBlock(id='1'), BlobBlock(id='2'), BlobBlock(id='3')]
block_blob_service.put_block_list(container_name=containername,blob_name=blobname,block_list=block_list)
exists=block_blob_service.exists(container_name=containername,blob_name=blobname)
print(exists)
blob_url=block_blob_service.make_blob_url(container_name=containername,blob_name=blobname)
print(blob_url)
Hope this is what you want, if you still have other problem please feel free to let me know.

Related

how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function

I have a requirement to create a cloud function which should check for a set of files in a GCS bucket and if all of those files arrives in GCS bucket then only it should launch the dataflow templates for all those files.
My existing cloud function code launches cloud dataflow for each file which comes into a GCS bucket. It runs different dataflows for different files based on naming convention. This existing code is working fine but my intention is not to trigger dataflow for each uploaded file directly.
It should check for set of files and if all the files arrives, then it should launch dataflows for those files.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result ?
from googleapiclient.discovery import build
import time
def df_load_function(file, context):
filesnames = [
'Customer_',
'Customer_Address',
'Customer_service_ticket'
]
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = 'xxx'
inputfile = 'gs://xxx/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
location = 'asia-south1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerRegion": "asia-south1",
"tempLocation": "gs://xxx/temp"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
I've written the below code for the above functionality. The cloud function is running without any error but it is not triggering any dataflow. Not sure what is happening as the logs has no error.
from googleapiclient.discovery import build
import time
import os
def df_load_function(file, context):
filesnames = [
'Customer_',
'Customer_Address_',
'Customer_service_ticket_'
]
paths =['Customer_','Customer_Address_','Customer_service_ticket_']
for path in paths :
if os.path.exists('gs://xxx/inbound/')==True :
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = 'xxx'
inputfile = 'gs://xxx/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
location = 'asia-south1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerRegion": "asia-south1",
"tempLocation": "gs://xxx/temp"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
else:
exit()
Could someone please help me with the above python code.
Also my file names contain current dates at the end as these are incremental files which I get from different source teams.
If I'm understanding your question correctly, the easiest thing to do is to write basic logic in your function that determines if the entire set of files is present. If not, exit the function. If yes, run the appropriate Dataflow pipeline. Basically implementing what you wrote in your first paragraph as Python code.
If it's a small set of files it shouldn't be an issue to have a function run on each upload to check set completeness. Even if it's, for example, 10,000 files a month the cost is extremely small for this service assuming:
Your function isn't using lots of bandwidth to transfer data
The code for each function invocation doesn't take a long time to run.
Even in scenarios where you can't meet these requirements Functions is still pretty cheap to run.
If you're worried about costs I would recommend checking out the Google Cloud Pricing Calculator to get an estimate.
Edit with updated code:
I would highly recommend using the Google Cloud Storage Python client library for this. Using os.path likely won't work as there are additional underlying steps required to search a bucket...and probably more technical details there than I fully understand.
To use the Python client library, add google-cloud-storage to your requirements.txt. Then, use something like the following code to check the existence of an object. This example is based off an HTTP trigger, but the gist of the code to check object existence is the same.
from google.cloud import storage
import os
def hello_world(request):
# Instantiate GCS client
client = storage.client.Client()
# Instantiate bucket definition
bucket = storage.bucket.Bucket(client, name="bucket-name")
# Search for object
for file in filenames:
if storage.blob.Blob(file, bucket) and "name_modifier" in file:
# Run name_modifier Dataflow job
elif storage.blob.Blob(file, bucket) and "name_modifier_2" in file:
# Run name_modifier_2 Dataflow job
else:
return "File not found"
This code ins't exactly what you want from a logic standpoint, but should get you started. You'll probably want to just make sure all of the objects can be found first and then move to another step where you start running the corresponding Dataflow jobs for each file if they are all found in the previous step.

Can't list bucket objects on Scaleway using boto3

I saw a few similar posts, but unfortunately none helped me.
I have an s3 bucket (on scaleway), and I'm trying to simply list all objects contained in that bucket, using boto3 s3 client as follow:
s3 = boto3.client('s3',
region_name=AWS_S3_REGION_NAME,
endpoint_url=AWS_S3_ENDPOINT_URL,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
all_objects = s3.list_objects_v2(Bucket=AWS_STORAGE_BUCKET_NAME)
This simple piece of code responds with an error:
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjects operation: The specified key does not exist.
First, the error seems inapropriate to me since I'm not specifying any key to search. I also tried to pass a Prefix argument to this method to narrow down the search to a specific subdirectory, same error.
Second, I tried to achieve the same thing using boto3 Resource rather than Client, as follow:
session = boto3.Session(
region_name=AWS_S3_REGION_NAME,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
resource = session.resource(
's3',
endpoint_url=AWS_S3_ENDPOINT_URL,
)
for bucket in resource.buckets.all():
print(bucket.name)
That code produces absolutely nothing. One weird thing that strikes me is that I don't pass the bucket_name anywhere here, which seems to be normal according to aws documentation
There's no chance that I misconfigured the client, since I'm able to use the put_object method perfectly with that same client. One strange though: when I want to put a file, I pass the whole path to put_object as Key (as I found it to be the way to go), but the object is inserted with the bucket name prepend to it. So let's say I call put_object(Key='/path/to/myfile.ext'), the object will end up to be /bucket-name/path/to/myfile.ext.
Is this strange behavior the key to my problem ? How can I investigate what's happening, or is there another way I could try to list bucket files ?
Thank you
EDIT: So, after logging the request that boto3 client is sending, I noticed that the bucket name is append to the url, so instead of requesting https://<bucket_name>.s3.<region>.<provider>/, it requests https://<bucket_name>.s3.<region>.<provider>/<bucket-name>/, which is leading to the NoSuchKey error.
I took a look into the botocore library, and I found this:
url = _urljoin(endpoint_url, r['url_path'], host_prefix)
in botocore.awsrequest line 252, where r['url_path'] contains /skichic-bucket?list-type=2. So from here, I should be able to easily patch the library core to make it work for me.
Plus, the Prefix argument is not working, whatever I pass into it I always receive the whole bucket content, but I guess I can easily patch this too.
Now it's not satisfying, since there's no issue related to this on github, I can't believe that the library contains such a bug that I'm the first one to encounter.
Does anyone can explain this whole mess ? >.<
For those who are facing the same issue, try changing your endpoint_url parameter in your boto3 client or resource instantiation from https://<bucket_name>.s3.<region>.<provider> to https://s3.<region>.<provider> ; i.e for Scaleway : https://s3.<region>.scw.cloud.
You can then set the Bucket parameter to select the bucket you want.
list_objects_v2(Bucket=<bucket_name>)
you can try this. you'll have to use your resource instead of my s3sr.
s3sr = resource('s3')
bucket = 'your-bucket'
prefix = 'your-prefix/' # if no prefix, pass ''
def get_keys_from_prefix(bucket, prefix):
'''gets list of keys for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
keys_list = get_keys_from_prefix(bucket, prefix)
After looking more closely into things, I've found out that (a lot) of botocore services endpoints patterns starts with the bucket name. For example, here's the definition of the list_objects_v2 service:
"ListObjectsV2":{
"name":"ListObjectsV2",
"http":{
"method":"GET",
"requestUri":"/{Bucket}?list-type=2"
},
My guess is that in the standard implementation of AWS S3, there's a genericendpoint_url (which explains #jordanm comment) and the targeted bucket is reached through the endpoint.
Now, in the case of Scaleway, there's an endpoint_url for each bucket, with the bucket name contained in that url (e.g https://<bucket_name>.s3.<region>.<provider>), and any endpoint should directly starts with a bucket Key.
I made a fork of botocore where I rewrote every endpoint to remove the bucket name, if that can help someone in the future.
Thank's again to all contributors !

Firebase Storage remove custom metadata key

I couldn't remove a custom metadata key from a file in Firebase storage.
This is what I tried so far:
blob = bucket.get_blob("dir/file")
metadata = blob.metadata
metadata.pop('custom_key', None) # or del metadata['custom_key']
blob.metadata = metadata
blob.patch()
I also tried to set its value to None but it didn't help.
It seems that there are some reasons that could be affecting you to delete the custom metadata. I will address them individually, so it's easier for understanding.
First, it seems that when you read the metadata with blob.metadata, it only returns as a read-only - as clarified here. So, your updates will not work as you would like, using the way you are trying. The second reason, it seems that saving the metadata again back to blob, follows a different order than what you are trying - as shown here.
You can give it a try using the below code:
blob = bucket.get_blob("dir/file")
metadata = blob.metadata
metadata.pop{'custom_key': None}
blob.patch()
blob.metadata = metadata
While this code is untested, I believe it might help you changing the orders and avoid the blob.metadata read-only situation.
In case this doesn't help you, I would recommend you to raise an issue for in the official Github repository for the Python library on Cloud Storage, for further clarifications from the developers.

Function service_account.Credentials.from_service_account_info() not working

I'm writing an application based on GCP services and I need to access to an external project. I stored on my Firestore database the authentication file's informations of the other project I need to access to. I read this documentation and I tried to apply it but my code does not work. As the documentaion says, what I pass to the authentication method is a dictionary[str, str].
This is my code:
from googleapiclient import discovery
from google.oauth2 import service_account
from google.cloud import firestore
project_id = body['project_id']
user = body['user']
snap_id = body['snapshot_id']
debuggee_id = body['debuggee_id']
db = firestore.Client()
ref = db.collection(u'users').document(user).collection(u'projects').document(project_id)
if ref.get().exists:
service_account_info = ref.get().to_dict()
else:
return None, 411
credentials = service_account.Credentials.from_service_account_info(
service_account_info,
scopes=['https://www.googleapis.com/auth/cloud-platform'])
service = discovery.build('clouddebugger', 'v2', credentials=credentials)
body is just a dictionary containing all the informations of the other project. What I can't understand is why this doesn't work and instead using the method from_service_account_file it works.
The following code will give to that method the same informations of the previous code, but inside a json file instead of a dictionary. Maybe the order of the elements is different, but I think that it doesn't matter at all.
credentials = service_account.Credentials.from_service_account_file(
[PATH_TO_PROJECT_KEY_FILE],
scopes=['https://www.googleapis.com/auth/cloud-platform'])
Can you tell me what I'm doing wrong with the method from_service_account_info?
Problem solved. When I posted the question I manually inserted from the GCP Firestore Console all the info about the other project. Then I wrote the code to make it authomatically and it worked. Honestly I don't know why it didn't worked before, the informations put inside Firestore were the same and the format as well.

GCP - get full information about bucket

I need to get the file information stored in Google Bucket. Information Like Filesize, Storage Class, Last Modified, Type. I searched for the google docs but it can be done by curl or console method. I need to get that information from the Python API like downloading the blob, uploading the blob to the bucket. Sample code or any help is appreciated!!
To get the object metadata you can use the following code:
from google.cloud import storage
def object_metadata(bucket_name, blob_name):
"""Prints out a blob's metadata."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.get_blob(blob_name)
print('Blob: {}'.format(blob.name))
print('Bucket: {}'.format(blob.bucket.name))
print('Storage class: {}'.format(blob.storage_class))
print('ID: {}'.format(blob.id))
print('Size: {} bytes'.format(blob.size))
print('Updated: {}'.format(blob.updated))
print('Generation: {}'.format(blob.generation))
print('Metageneration: {}'.format(blob.metageneration))
print('Etag: {}'.format(blob.etag))
print('Owner: {}'.format(blob.owner))
print('Component count: {}'.format(blob.component_count))
print('Crc32c: {}'.format(blob.crc32c))
print('md5_hash: {}'.format(blob.md5_hash))
print('Cache-control: {}'.format(blob.cache_control))
print('Content-type: {}'.format(blob.content_type))
print('Content-disposition: {}'.format(blob.content_disposition))
print('Content-encoding: {}'.format(blob.content_encoding))
print('Content-language: {}'.format(blob.content_language))
print('Metadata: {}'.format(blob.metadata))
object_metadata('bucketName', 'objectName')
Using the Cloud Storage client library, and checking at the docs for buckets, you can do this to get the Storage Class:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('YOUR_BUCKET')
print(bucket.storage_class)
As for the size and last modified files (at least it's what I understood from your question), those belong to the files itself. You could iterate the list of blobs in your bucket and check that:
for blob in bucket.list_blobs():
print(blob.size)
print(blob.updated)

Resources