Writing a new file to a Google Cloud Storage bucket from a Google Cloud Function (Python) - python-3.x

I am trying to write a new file (not upload an existing file) to a Google Cloud Storage bucket from inside a Python Google Cloud Function.
I tried using google-cloud-storage but it does not have the
"open" attribute for the bucket.
I tried to use the App Engine library GoogleAppEngineCloudStorageClient but the function cannot deploy with this dependencies.
I tried to use gcs-client but I cannot pass the credentials inside the function as it requires a JSON file.
Any ideas would be much appreciated.
Thanks.

from google.cloud import storage
import io
# bucket name
bucket = "my_bucket_name"
# Get the bucket that the file will be uploaded to.
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket)
# Create a new blob and upload the file's content.
my_file = bucket.blob('media/teste_file01.txt')
# create in memory file
output = io.StringIO("This is a test \n")
# upload from string
my_file.upload_from_string(output.read(), content_type="text/plain")
output.close()
# list created files
blobs = storage_client.list_blobs(bucket)
for blob in blobs:
print(blob.name)
# Make the blob publicly viewable.
my_file.make_public()

You can now write files directly to Google Cloud Storage. It is no longer necessary to create a file locally and then upload it.
You can use the blob.open() as follows:
from google.cloud import storage
def write_file():
client = storage.Client()
bucket = client.get_bucket('bucket-name')
blob = bucket.blob('path/to/new-blob.txt')
with blob.open(mode='w') as f:
for line in object:
f.write(line)
You can find more examples and snippets here:
https://github.com/googleapis/python-storage/tree/main/samples/snippets

You have to create your file locally and then to push it to GCS. You can't create a file dynamically in GCS by using open.
For this, you can write in the /tmp directory which is an in memory file system. By the way, you will never be able to create a file bigger than the amount of the memory allowed to your function minus the memory footprint of your code. With a function with 2Gb, you can expect a max file size of about 1.5Gb.
Note: GCS is not a file system, and you don't have to use it like this
EDIT 1
Things have changed since my answer:
It's now possible to write in any directory in the container (not only the /tmp)
You can stream write a file in GCS, as well as you receive it in streaming mode on CLoud Run. Here a sample to stream write to GCS.
Note: stream write deactivate the checksum validation. Therefore, you won't have integrity checks at the end of the file stream write.

Related

Give direct access of local files to Document AI

I know there is a way by which we can call Document AI from python environment in local system. In that process one needs to upload the local file to GCS bucket so that Document AI can access the file from there. Is there any way by which we can give direct access of local files to Document AI (i.e., without uploading the file to GCS bucket) using python? [Note that it's a mandatory requirement for me to run python code in local system, not in GCP.]
DocumentAI cannot "open" files by itself from your local filesystem.
If you don't want / cannot upload the documents to a bucket, you can send them in as part of the REST API. BUT in this case you cannot use BatchProcessing: I mean, you must process the files one by one and wait for a response.
The relevant REST API documentation is here: https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/process
In the quickstart documentation for python you've got this sample code that reads a file and sends it inline as part of the request:
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)

Azure Blob Using Python

I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!
The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.
Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.

upload time very slow with multiple file as blob uploads on Azure storage containers using Python

I want to store some user images from a feedback section of an app I am creating for which I am using Azure containers.
Although I am able to store the images the process is taking ~150 to 200 seconds for 3 files of ~190kb each
Is there a better way to upload the files or am I missing something here (file type used for upload maybe).
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,ContentSettings
connect_str = 'my_connection_string'
# Create the BlobServiceClient object which will be used to create a container client
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
photos_object_list = []
my_content_settings = ContentSettings(content_type='image/png')
#Files to upload
file_list = ['/Users/vikasnair/Documents/feedbackimages/reference_0.jpg','/Users/vikasnair/Documents/feedbackimages/reference_1.jpeg','/Users/vikasnair/Documents/feedbackimages/reference_2.jpeg']
#creating list of file objects as I would be taking list of file objects from the front end as an input
for i in range(0,len(file_list)):
photos_object_list.append(open(file_list[i],'rb'))
import timeit
start = timeit.default_timer()
if photos_object_list != None:
for u in range(0,len(photos_object_list)):
blob_client = blob_service_client.get_blob_client(container="container/folder", blob=loc_id+'_'+str(u)+'.jpg')
blob_client.upload_blob(photos_object_list[u], overwrite=True, content_settings=my_content_settings)
stop = timeit.default_timer()
print('Time: ', stop - start)
A couple things you can do to decrease the duration of the upload:
Reuse blob_service_client for all the uploads rather than creating a new blob_client for each file.
Use async blob methods to upload all the file in parallel rather than sequentially.
If the target blob parent folder is static, you can also use an Azure file share with your container instance. This will allow you to use regular file system operations to persist files to and retrieve files from a storage account.

I am not able to read dat file from S3 bucket using lambda function

I have been trying to read dat file from one s3 bucket and convert it into CSV and then compress it and put it into another bucket
for open and reading i am using below code but it is throwing me an error No such file or directory
with open(f's3://{my_bucket}/{filenames}', 'rb') as dat_file:
print(dat_file)'''
The Python language does not natively know how to access Amazon S3.
Instead, you can use the boto3 AWS SDK for Python. See: S3 — Boto 3 documentation
You also have two choices about how to access the content of the file:
Download the file to your local disk using download_file(), then use open() to access the local file, or
Use get_object() to obtain a StreamingBody of the file contents
See also: Amazon S3 Examples — Boto 3 documentation

Unable to use data from Google Cloud Storage in App Engine using Python 3

How can I read the data stored in my Cloud Storage bucket of my project and use it in my Python code that I am writing in App Engine?
I tried using:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
But I am unable to figure out how to extract actual data from the code to get it in a usable form.
Any help would be appreciated.
Getting a file from a Google Cloud Storage bucket means that you are just getting an object. This concept abstract the file itself from your code. You will either need to store locally the file to perform any operation on it or depending on the extension of your file put that object inside of a file readstreamer or the method that you need to read the file.
Here you can see a code example on how to read a file from app engine:
def read_file(self, filename):
self.response.write('Reading the full file contents:\n')
gcs_file = gcs.open(filename)
contents = gcs_file.read()
gcs_file.close()
self.response.write(contents)
You have a couple of options.
content = blob.download_as_string() --> Converts the content of your Cloud Storage object to String.
blob.download_to_file(file_obj) --> Updates an existing file_obj to include the Cloud Storage object content.
blob.download_to_filename(filename) --> Saves the object in a file. On App Engine Standard environment, you can store files in /tmp/ directory.
Refer this link for more information.

Resources