I am trying to upload files to a s3 bucket using the node js aws-sdk V3.
I know I am supposed to be using the commmands: CreateMultipartUploadCommand,UploadPartCommandand so forth. But I can't find any working example of a full multipart upload.
Can anyone share any code samples?
Thanks in advance
I put together a bit of code for this: https://gist.github.com/kbanman/0aa36ffe415cdc6c44293bc3ddb6448e
The idea is to upload a part to S3 whenever we receive a chunk of data in the stream, and then finalize the upload when the stream is finished.
Complicating things is S3's minimum part size of 5MB on all but the last part in the series. This means that we need to buffer data until we can form those 5MB chunks. I accomplished this using a transformer that adds back-pressure on the content stream between each chunk upload.
Parallelization is also made difficult by the fact that S3 insists on receiving the parts in order (despite asking for the parts to be numbered).
Related
I just started learning about Azure blob storage. I have come across various ways to upload and download the data. One thing that puzzles me to when to use what.
I am mainly interested in PutBlockAsync in conjunction with PutBlockListAsync and UploadFromStreamAsync.
As far as I understand when using PutBlockAsync it is up to the user to break the data into chunks and making sure each chunk is within the Azure block blob size limits. There is an id associated with each chunk that is uploaded. At the end, all the ids are committed.
When using UploadFromStreamAsync, how does this work? Who handles chunking the data and uploading it.
Why not convert the data into Stream and use UploadFromStreamAsync all the time and avoid two commits?
You can use fiddler, and observe what happens when use UploadFromStreamAsync.
If the file is larger(more than 256MB), such as 500MB, the Put Block and Put Block List api are called in the background(they are also called when use PutBlockAsync and PutBlockListAsync method)
If the file is small than 256MB, then it(UploadFromStreamAsync) will call the Put Blob api in the background.
I use UploadFromStreamAsync and uploading a file whose size is 600MB, then open the fidder.
Here are some findings from fidder:
1.The large file is broken into small size(4MB) one by one, and calls Put Block api in the background:
2.At the end, the Put Block List api will be called:
Current limit of AWS Lambda on post/put size request is 6mb
Current limit of S3 on multipart upload is 5mb chunk size
5mb encoded file in upload request actually occupy more than 6mb, so the solution to upload chunks straightforward to lambda functions doesn't work.
How properly implement large files uploading (more than 5mb)?
After some struggles I come across to solution to use Object Expiration rules as temp files for uploading sub chunks first then after sub chunk is uploaded I upload normal chunk with provided sub chunk id to fetch it from temp files and then concatenate in lambda function uploading chunk with sub chunk to make one part at least 5mb. This solution is working, but I still have to wait when sub chunk being uploaded, so fully parallel multipart uploading solution will not work this way (we still have to wait when sub chunks will uploaded).
Any better ideas how to solve such problem with limits of S3 and AWS Lambda?
I'm working with the Node.js AWS SDK for S3. I have a zip file output stream that I'd like to upload to S3 bucket. Seems simple enough reading the docs. But I noticed there are optional part size and queue size parameters, I was wondering what exactly are these? Should I use them? If so how do I determine appropriate values? Much appreciated.
This is a late response.
Multiple parts can be queued and sent in parallel, the size of this parts is the parameter partSize
The queueSize parameter is how many parts you can process.
The max memory usage is partSize * queueSize so i think the values you are looking depends on the memory available in your machine.
In a mem-constrained environment (AWS Lambda) I'm trying to do the following:
read jsons from a queue (AWS SQS). It's not known how many jsons exist on the queue
consequently content-length isn't known beforehand
each json is an array of objects
combine these jsons in one big array (basically concatting the arrays)
stream the combined json file to S3 while it's still being made.
This with the goal to keep mem usage low, even though the entire output file might end up being several GBs.
Although 'm pretty sure S3 can be streamed to using S3 Multipart Upload it's not clear if the entire setup would work. Any pointers, or streaming libraries that take the plumbing out highly appreciated.
In Haskell, I'm processing some data via conduits. During that processing, I want to conditionally store that data in S3. Are there any S3 libraries that will allow me to do this? Effectively, what I want to do is "tee" the pipeline created by the conduit and put the data it contains on S3 while continuing to process it.
I've found the aws library (https://hackage.haskell.org/package/aws), but the functions like multipartUpload take a Source as an argument. Given that I'm already inside the conduit, this doesn't seem like something I can use.
There is now a package—amazonka-s3-streaming—that exposes a multi-part upload to S3 as a conduit Sink.
This is not really an answer, but merely a hint. amazonka seems to expose RequestBody of requests from http-client. So in theory it's possible to pipe data there from conduits. Yet seems that you have to know digest of the data beforehand.
So does tell Can I stream a file upload to S3 without a content-length header? too.