How do I use a cloud function to unzip a large file in cloud storage? - python-3.x

I have a cloud function which is triggered when a zip is uploaded to cloud storage and is supposed to unpack it. However the function runs out of memory, presumably since the unzipped file is too large (~2.2 Gb).
I was wondering what my options are for dealing with this problem? I read that it's possible to stream large files into cloud storage but I don't know how to do this from a cloud function or while unzipping. Any help would be appreciated.
Here is the code of the cloud function so far:
storage_client = storage.Client()
bucket = storage_client.get_bucket("bucket-name")
destination_blob_filename = "large_file.zip"
blob = bucket.blob(destination_blob_filename)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as myzip:
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
blob = bucket.blob(contentfilename)
blob.upload_from_string(contentfile)

Your target process is risky:
If you stream file without unzipping it totally, you can't validate the checksum of the zip
If you stream data into GCS, file integrity is not guaranteed
Thus, you have 2 successful operation without checksum validation!
Before having Cloud Function or Cloud Run with more memory, you can use Dataflow template to unzip your files

Related

Python code to transfer file from GCS to SFTP server using GCP's Cloud Function

Hi I am newto Python and I was wondering whether someone can help me with the following:
I need to write a code in the cloud function to copy a .csv file from a bucket in GCS to a sftp server.
My bucket is called 001b and the file is called test.csv and i have the hostname user name and port number and password of the sftp server.username=uid password=mypassword port = 22 host https://....
I am trying to create a cloud function with a trigger that every time the file is created in the above bucket it will then transfer it to the sftp server. There will always be one file in the bucket as the csv is overwritten daily.
I am using 2nd gen environment and have my trigger set to Cloud Storage with event type as google.cloud.storage.object.v1.finalized.
I really need help with the code for main.py and requirements.txt for python 3.8
Any help is appreciated

How to set output as azure blob in FFMPEG command

In my Queue trigger Azure Function, I am resizing video using FFMPEG command.
subprocess.run(["ffmpeg", "-i",
"input.mp4",
"-filter_complex", "scale=1920:1080:force_original_aspect_ratio=decrease,pad=1920:1080:-1:-1,setsar=1,fps=25",
"-c:v","libx264",
"-c:a","aac",
"-preset:v", "ultrafast",
"output.mp4"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
Currently output will be write locally and use azure function local space. What I want that output file directly write to azure container as a blob. And not save it to locally as function has less space and for resizing its required more space which causing issue.
How can I achieve this?
FFMPEG supports in-memory PIPEs for both the input and output. The in-memory data can be easily stored as a blob in the Blob storage

How to abort uploading a stream to google storage in Node.js

Interaction with Cloud Storage is performed using the official Node.js Client library.
Output of an external executable (ffmpeg) through fluent-ffmpeg is piped to a writable stream of a Google Cloud Storage object using [createWriteStream].(https://googleapis.dev/nodejs/storage/latest/File.html#createWriteStream).
Executable (ffmpeg) can end with an error. In this case the file is created on Cloud Storage with 0 length.
I want to abort uploading on the command error to avoid finalizing an empty storage object.
What is the proper way of aborting the upload stream?
Current code (just an excerpt):
ffmpeg()
.input(sourceFile.createReadStream())
.output(destinationFile.createWriteStream())
.run();
Files are instances of https://cloud.google.com/nodejs/docs/reference/storage/latest/storage/file.

ASP.NET Core Higher memory use uploading files to Azure Blob Storage SDK using v12 compared to v11

I am building a service with an endpoint that images and other files will be uploaded to, and I need to stream the file directly to Blob Storage. This service will handle hundreds of images per second, so I cannot buffer the images into memory before sending it to Blob Storage.
I was following the article here and ran into this comment
Next, using the latest version (v12) of the Azure Blob Storage libraries and a Stream upload method. Notice that it’s not much better than IFormFile! Although BlobStorageClient is the latest way to interact with blob storage, when I look at the memory snapshots of this operation it has internal buffers (at least, at the time of this writing) that cause it to not perform too well when used in this way.
But, using almost identical code and the previous library version that uses CloudBlockBlob instead of BlobClient, we can see a much better memory performance. The same file uploads result in a small increase (due to resource consumption that eventually goes back down with garbage collection), but nothing near the ~600MB consumption like above
I tried this and found that yes, v11 has considerably less memory usage compared to v12! When I ran my tests with about a ~10MB file the memory, each new upload (after initial POST) jumped the memory usage 40MB, while v11 jumped only 20MB
I then tried a 100MB file. On v12 the memory seemed to use 100MB nearly instantly each request and slowly climbed after that, and was over 700MB after my second upload. Meanwhile v11 didn't really jump in memory, though it would still slowly climb in memory, and ended with around 430MB after the 2nd upload.
I tried experimenting with creating BlobUploadOptions properties InitialTransferSize, MaximumConcurrency, etc. but it only seemed to make it worse.
It seems unlikely that v12 would be straight up worse in performance than v11, so I am wondering what I could be missing or misunderstanding.
Thanks!
Sometimes this issue may occur due to Azure blob storage (v12) libraries.
Try to upload the large files in chunks [a technique called file chunking which breaks the large file into smaller chunks for each upload] instead of uploading whole file. Please refer this link
I tried  producing the scenario in my lab
public void uploadfile()
{
string connectionString = "connection string";
string containerName = "fileuploaded";
string blobName = "test";
string filePath = "filepath";
BlobContainerClient container = new BlobContainerClient(connectionString, containerName);
container.CreateIfNotExists();
// Get a reference to a blob named "sample-file" in a container named "sample-container"
BlobClient blob = container.GetBlobClient(blobName);
// Upload local file
blob.Upload(filePath);
}
The output after uploading file.

Azure Function App copy blob from one container to another using startCopy in java

I am using java to write a Azure Function App which is eventgrid trigger and the trigger is blobcreated. So when ever blob is created it will be trigerred and the function is to copy a blob from one container to another. I am using startCopy function from com.microsoft.azure.storage.blob. It was working fine but sometimes, It uses to copy files of zero bytes which are actually containing some data in source location. So at destination sometimes it dumps zero bytes of files. I would like to have a little help on this so that I could understand how to possibly handle this situation
CloudBlockBlob cloudBlockBlob = container.getBlockBlobReference(blobFileName);
CloudStorageAccount storageAccountdest = CloudStorageAccount.parse("something");
CloudBlobClient blobClientdest = storageAccountdest.createCloudBlobClient();
CloudBlobContainer destcontainer = blobClientdest.getContainerReference("something");
CloudBlockBlob destcloudBlockBlob = destcontainer.getBlockBlobReference();
destcloudBlockBlob.startCopy(cloudBlockBlob);
Copying blobs across storage accounts is an async operation. When you call startCopy method, it just signals Azure Storage to copy a file. Actual file copy operation happens asynchronously and may take some time depending how how large file you're transferring.
I would suggest that you check the copy operation progress on the target blob to see how many bytes have been copied and if there's a failure in the copy operation. You can do so by fetching the properties of the target blob. A copy operation could potentially fail if the source blob is modified after the copy operation has started by Azure Storage.
had the same problem, and later figured out from the docs
Event Grid isn't a data pipeline, and doesn't deliver the actual
object that was updated
Event grid will tell you that something has changed and that the actual message has a size limit and as long as the data that you are copying is within that limit it will be successful if not it will be 0 bytes. I was able to copy upto 1mb and beyond that it resulted 0 bytes. You can try and see if azure has increased by size limit in the recent.
However if you want to copy the complete data then you need to use Event Hub or Service Bus. For mine, I went with service bus.

Resources