Python Boto3 - upload images to S3 in one put request - python-3.x

I have a script in python built out, that uploads images to s3, one image|put request at a time. Is it possible to upload all images to s3 at the same time, using one put request, to save $$ on requests?
for image_id in list_of_images:
#upload each image
filename = id_prefix+"/"+'{0}.jpg'.format(image_id)
s3.upload_fileobj(buffer, bucket_name, filename, ExtraArgs={ "ContentType": "image/jpeg"})

No.
The Amazon S3 API only allows creation of one object per API call.
Your options are to loop through each file (as you have done) or, if you wish to make it faster, you could use multi-threading to upload multiple files simultaneously to take advantage of more networking bandwidth.
If your desire is simply to reduce requests costs, do not panic. It is only $0.005 per 1000 requests.

Good news - it looks like several companies came out with boto3-like apis, with much better storage + download (per gb) pricing.
As of a few days ago - Backblaze came out with S3 compatible storage + API .
We did a few tests on our application, and everything seems to be working as advertised!

Related

Stream tiny files directly to S3 into a zip with multipart upload utilizing parallel multiple Lambda's using NodeJS

I'll have 100's or 1000's of tiny pdf files that I'll need to zip into one big zip file and upload to S3. My currently solution is as follows:
NodeJS service send request with JSON data of all the pdf files I need to create and zip to a Lambda function
Lambda function processes data, creates each pdf file as buffer, pushes buffer into zip archiver, finalizes archive and then finally zip archive is streamed using PassThroughStream in chunks to S3.
I basically copied the below solution.
https://gist.github.com/amiantos/16bacc9ed742c91151fcf1a41012445e?permalink_comment_id=3804034#gistcomment-3804034
Now although this is a working solution, its not scalable and all of creating pdf buffer, archiving zip and upload to S3 happens in a single lambda execution which takes 20-30 seconds or more depending on the size of the final archived zip file. I have set the Lambda with 10GB memory and max 15 min timeout. Because for every 100MB of zip it requires 1GB of resources otherwise it times out due to max resources used. My zip could be 800MB or more sometimes which means it will require 8GB memory or more.
I want to use AWS multipart upload and somehow invoke multiple parallel lambda functions to achieve this. Its fine if I have to separate the creating the pdf buffers, zipping and s3 uploading into other lambdas. But I need to somehow optimize this and make it run parallely.
I see this post's answer with some nice details and example but it seems to be for a single large file.
Stream and zip to S3 from AWS Lambda Node.JS
https://gist.github.com/vsetka/6504d03bfedc91d4e4903f5229ab358c
Any way I can optimize this? Any ideas and suggestions would be great. Keep in mind the end result needs to be one big zip file. Thanks

Streaming uploading to S3 using presigned URL

I'm trying to upload a file into a customer's S3. I'm given a presigned URL that allows me to do a PUT request. I have no access to their access and secret key so the use of the AWS SDK is out of the question.
The use case is that I am consuming a gRPC server streaming call and transforming it into a csv with some field changes. As the calls come in, I would want to be able to stream the transformed gRPC response into S3. I would need to do it via streaming cause the response can get rather large, upwards of >100mb, so loading everything into memory before uploading it into S3 is not ideal. Any ideas?
This is an open issue with pre-signed S3 upload URL:
https://github.com/aws/aws-sdk-js/issues/1603
Currently, the only working solution for large upload through S3 pre-signed URL is to use multi-part upload. The biggest drawback of that approach is you need to let the server that signs the upload know the size of your file, as it will need to pre-sign each part individually. e.g: 100MB file upload will require 20 parts (each 5MB maximum) to be pre-signed individually.

Refresh every 5 seconds - how to cache s3 files?

I store image files of my user model on s3. My frontend fetches new data from the backend (nodeJS) every 5 seconds. In each of those fetches, all users are retrieved which involves getting the image file from s3. Once the application scales this results in a huge request amount on s3 and high costs so I guess caching the files on the backend makes sense since they rarely change once uploaded.
How would I do it? Cache the file once downloaded from s3 onto the local file system of the server and only download them again if a new upload happened? Or is there a better mechanism for this?
Alternatively, when I set the cache header on the s3 files, are they still being fetched everytime I call s3.getObject or does that already achieve what I'm trying to do?
You were right in terms of the cost, which CloudFront would not improve. I was misleading.
Back to your problem, you can cache the files in the S3 bucket adding the metadata for that.
For example:
Cache-control = max-age=172800
You can do that in the console, or through the aws cli for instance.
If you request the files directly, and these have the headers, the browser should do a check on the etag
Validating cached responses with ETags TL;DR The server uses the ETag
HTTP header to communicate a validation token. The validation token
enables efficient resource update checks: no data is transferred if
the resource has not changed.
If you requests the files whit s3.getObject method it would do the request anyway, so it would download the file again.
Pushing not requesting:
If you can't do this, you might want to think about the backend pushing only new data to the frontend, instead of it requesting new data every 5 seconds, which would make the load significantly lower.
---
No so cost effective, more speed focused.
You could use CloudFront as a CDN for your S3 bucket. This will allow you to get the file faster, and also CloudFront would handle the cache for you.
You would need to setup the TTL accordingly to your needs, you can also invalidate the cache everytime you make an upload of a file if you need so.
From the docs:
Storing your static content with S3 provides a lot of advantages. But to help optimize your application’s performance and security while effectively managing cost, we recommend that you also set up Amazon CloudFront to work with your S3 bucket to serve and protect the content. CloudFront is a content delivery network (CDN) service that delivers static and dynamic web content, video streams, and APIs around the world, securely and at scale. By design, delivering data out of CloudFront can be more cost effective than delivering it from S3 directly to your users.

AWS Lambda Function - Image Upload - Process Review

I'm trying to better understand how the overall flow should work with AWS Lambda and my Web App.
I would like to have the client upload a file to a public bucket (completely bypassing my API resources), with the client UI putting it into a folder for their account based on a GUID. From there, I've got lambda to run when it detects a change to the public bucket, then resizing the file and placing it into the processed bucket.
However, I need to update a row in my RDS Database.
Issue
I'm struggling to understand the best practice to use for identifying the row to update. Should I be uploading another file with the necessary details (where every image upload consists really of two files - an image and a json config)? Should the image be processed, and then the client receives some data and it makes an API request to update the row in the database? What is the right flow for this step?
Thanks.
You should use a pre-signed URL for the upload. This allows your application to put restrictions on the upload, such as file type, directory and size. It means that, when the file is uploaded, you already know who did the upload. It also prevents people from uploading randomly to the bucket, since it does not need to be public.
The upload can then use an Amazon S3 Event to trigger the Lambda function. The filename/location can be used to identify the user, so the database can be updated at the time that the file is processed.
See: Uploading Objects Using Presigned URLs - Amazon Simple Storage Service
I'd avoid uploading a file directly to S3 bypassing the API. Uploading file from your API allows you to control type of file, size etc as well as you will know who exactly is uploading the file (API authid or user id in API body). This is also a security risk to open a bucket to public for writes.
Your API clients can then upload the file via API, which then can store file on S3 (trigger another lambda for processing) and then update your RDS with appropriate meta-data for that user.

Transferring a very large image from the web to S3 using AWS Lambda

I have access to a 20 GB image file from the web that we'd like to save on S3.
Is it possible to do this with AWS Lambda? From how I understand the situation, the limitations seem to be the following:
The lambda memory (can't load the whole image into memory)
Now if we decide to stream from the web to S3 (say using requests.get(image_url, stream=True) or smart_open..
the lambda reaching its timeout limit, along with..
S3 not supporting appending to S3 objects. Thus, succeeding lambda runs to continue "assembling" the image on S3 (where preceding ones left off) will have to load the partial image that's already on S3 into memory, before it can start appending more data, and uploading the resulting larger partial image to S3.
I've also heard of others suggesting to use multi-part uploads. But I'd be happy to know how that's different from streaming, and how that will overcome the limitations listed above.
Thank you!
Things are much simplified with s3.
Create a lambda to generate pre-signed url for multipart upload.
Create Multipart Upload:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#createMultipartUpload-property
Create Signed URL with the above Multipart Upload Key:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getSignedUrl-property
Use that url to upload multiple parts of your file parallel.
You can also use S3 accelerator for high-speed upload.
Hope it helps.
EDIT1:
You can split the file in chunk between 1 to 10,000 and upload them parallelly.
http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
If you are doing only one file upload, you can generate the signedurl and multipart in the cli rather than lambda.
If you are doing regularly, you can generate them via lambda.
When you read the file to upload, if you read them via HTTP, read them in a chunk and upload in multipart.
If you are reading the file locally, you can have the starting point of the file for each chunk and upload them with multipart.
Hope it helps.

Resources