Upload large files properly using AWS lambda and S3 (with existing limits)

Upload large files properly using AWS lambda and S3 (with existing limits) - node.js

Current limit of AWS Lambda on post/put size request is 6mb
Current limit of S3 on multipart upload is 5mb chunk size
5mb encoded file in upload request actually occupy more than 6mb, so the solution to upload chunks straightforward to lambda functions doesn't work.
How properly implement large files uploading (more than 5mb)?
After some struggles I come across to solution to use Object Expiration rules as temp files for uploading sub chunks first then after sub chunk is uploaded I upload normal chunk with provided sub chunk id to fetch it from temp files and then concatenate in lambda function uploading chunk with sub chunk to make one part at least 5mb. This solution is working, but I still have to wait when sub chunk being uploaded, so fully parallel multipart uploading solution will not work this way (we still have to wait when sub chunks will uploaded).
Any better ideas how to solve such problem with limits of S3 and AWS Lambda?

Related

aws-sdk Multipart Upload to s3 with node.js

I am trying to upload files to a s3 bucket using the node js aws-sdk V3.
I know I am supposed to be using the commmands: CreateMultipartUploadCommand,UploadPartCommandand so forth. But I can't find any working example of a full multipart upload.
Can anyone share any code samples?
Thanks in advance

I put together a bit of code for this: https://gist.github.com/kbanman/0aa36ffe415cdc6c44293bc3ddb6448e
The idea is to upload a part to S3 whenever we receive a chunk of data in the stream, and then finalize the upload when the stream is finished.
Complicating things is S3's minimum part size of 5MB on all but the last part in the series. This means that we need to buffer data until we can form those 5MB chunks. I accomplished this using a transformer that adds back-pressure on the content stream between each chunk upload.
Parallelization is also made difficult by the fact that S3 insists on receiving the parts in order (despite asking for the parts to be numbered).

How is the data chunked when using UploadFromStreamAsync and DownloadToStreamAsync when uploading to block blob

I just started learning about Azure blob storage. I have come across various ways to upload and download the data. One thing that puzzles me to when to use what.
I am mainly interested in PutBlockAsync in conjunction with PutBlockListAsync and UploadFromStreamAsync.
As far as I understand when using PutBlockAsync it is up to the user to break the data into chunks and making sure each chunk is within the Azure block blob size limits. There is an id associated with each chunk that is uploaded. At the end, all the ids are committed.
When using UploadFromStreamAsync, how does this work? Who handles chunking the data and uploading it.
Why not convert the data into Stream and use UploadFromStreamAsync all the time and avoid two commits?

You can use fiddler, and observe what happens when use UploadFromStreamAsync.
If the file is larger(more than 256MB), such as 500MB, the Put Block and Put Block List api are called in the background(they are also called when use PutBlockAsync and PutBlockListAsync method)
If the file is small than 256MB, then it(UploadFromStreamAsync) will call the Put Blob api in the background.
I use UploadFromStreamAsync and uploading a file whose size is 600MB, then open the fidder.
Here are some findings from fidder:
1.The large file is broken into small size(4MB) one by one, and calls Put Block api in the background:
2.At the end, the Put Block List api will be called:

Aws Lambda response error

I am running aws lambda which will fetch data from maria DB and return the fetched rows as a JSON object.
A total number of item in JSON array is 64K.
I am getting this error:
{ "error": "body size is too long" }
Is there a way I can send all 64K rows by making any configuration change to lambda?

You cannot send the 64K rows (Which goes beyond 6MB body payload size limit) making configuration changes to Lambda. Few alternative options are.
Query the data and build a JSON file with all the rows in /tmp (Up to 512MB) directory inside Lambda, upload it to S3 and return a CloudFront Signed URL to access the data.
Split the dataset into multiple pages and do multiple queries.
Use a EC2 instance or ECS, instead of Lambda.
Note: Based on the purpose of queried data, its size & etc. different mechanisms can be used, efficiently using other AWS services.

This error indicates that your response exceeds the maximum (6 MB), which is maximum data size AWS Lambda can respond.
http://docs.aws.amazon.com/lambda/latest/dg/limits.html

It seems that you're hitting the hard limit of a maximum 6 MB response size. As it's a hard limit there's unfortunately no way to increase this.
You'll need to set up your lambda to be able to send at most 6MB and paginate through the rows you need to retrieve in different invocations until you've fetched all 64K.
Sources:
https://docs.aws.amazon.com/lambda/latest/dg/limits.html#limits-list
https://forums.aws.amazon.com/thread.jspa?threadID=230229

Determining part size and queue size parameters for AWS S3 upload

I'm working with the Node.js AWS SDK for S3. I have a zip file output stream that I'd like to upload to S3 bucket. Seems simple enough reading the docs. But I noticed there are optional part size and queue size parameters, I was wondering what exactly are these? Should I use them? If so how do I determine appropriate values? Much appreciated.

This is a late response.
Multiple parts can be queued and sent in parallel, the size of this parts is the parameter partSize
The queueSize parameter is how many parts you can process.
The max memory usage is partSize * queueSize so i think the values you are looking depends on the memory available in your machine.

stream processing: combine jsons in a mem-constrained AWS Lambda and output a write stream to S3

In a mem-constrained environment (AWS Lambda) I'm trying to do the following:
read jsons from a queue (AWS SQS). It's not known how many jsons exist on the queue
consequently content-length isn't known beforehand
each json is an array of objects
combine these jsons in one big array (basically concatting the arrays)
stream the combined json file to S3 while it's still being made.
This with the goal to keep mem usage low, even though the entire output file might end up being several GBs.
Although 'm pretty sure S3 can be streamed to using S3 Multipart Upload it's not clear if the entire setup would work. Any pointers, or streaming libraries that take the plumbing out highly appreciated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Upload large files properly using AWS lambda and S3 (with existing limits) - node.js

Related

aws-sdk Multipart Upload to s3 with node.js

How is the data chunked when using UploadFromStreamAsync and DownloadToStreamAsync when uploading to block blob

Aws Lambda response error

Determining part size and queue size parameters for AWS S3 upload

stream processing: combine jsons in a mem-constrained AWS Lambda and output a write stream to S3

Categories

Resources