How do I read and upload a large file to s3? - node.js

I'm using Node.js .10.22 and q-fs
I'm trying to upload objects to S3, which stopped working once the objects were over a certain MB size.
Besides taking up all the memory on my machine, it gives me this error
RangeError: length > kMaxLength
at new Buffer (buffer.js:194:21)
When I try to use fs.read on the file.
Normally, when this works, I do s3.upload, and put the buffer in the Body field.
How do I handle large objects?

You'll want to use a streaming version of the API to pipe your readable filesystem stream directly to the S3 upload http request body stream provided by the s3 module you are using. Here's an example straight from the aws-sdk documentation
var fs = require('fs');
var body = fs.createReadStream('bigfile');
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}).
on('httpUploadProgress', function(evt) { console.log(evt); }).
send(function(err, data) { console.log(err, data) });

Related

Use streams in a multipart upload to S3

The current project I'm working on requires that multiple processes upload data to a single file in S3. This data comes from multiple sources in parallel, so in order to process all the sources as fast as possible we'll use multiple Nodejs instances to listen to the sources. There are memory and storage constraints, so load all ingested data to memory or store in disk and then perform a single upload is out of question.
To respect those constraints I implemented a streamed upload: it buffers a small portion of the data from a single source and pipes the buffer to an upload stream. This works really well when using a single nodejs process, but, as I mentioned, the goal is to process all sources in parallel. My first try was to open multiple streams to the same object key in the bucket. This simply overrides the file with the data from the last stream to close. So I discarded this option.
// code for the scenario above, where each process will open a separete stream to
// the same key and perform it's data ingestion and upload.
openStreamingUpload() {
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
const passThrough = new stream.PassThrough();
const params = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
Body: passThrough
};
s3
.upload(params)
.promise();
return passThrough;
}
async main() { // simulating a "never ending" flow of data
const stream = openStreamingUpload();
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main();
Next I went to try the multipart upload that the S3 API offers. At first, I create a multipart upload, obtain its ID and store it in a shared space. After that I try to open multiple multipart uploads on all nodejs processes the cluster will be using, with the same UploadId obtained beforehand. Each one of those multipart uploads should have a stream that will pipe the data received. The problem I came across was that the multipart upload needs to know the part length beforehand and as I'm piping a stream that I don't know when will close or the amount of data it will pipe, it's not possible to calculate it's size. Code inspired by this implementation.
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
async startMultipartUpload()
const multiPartParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket'
};
const multipart = await s3.createMultipartUpload(multiPartParams).promise();
return multipart.UploadId;
}
async finishMultipartUpload(multipartUploadId) {
const finishingParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId
};
const data = await s3.completeMultipartUpload(finishingParams).promise();
return data;
}
async openMultiparStream(multipartUploadId) {
const stream = require('stream');
const passThrough = new stream.PassThrough();
const params = {
Body: passThrough.,
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId,
PartNumber: // how do I know this part number when it's, in principle, unbounded?
};
s3
.uploadPart(params)
.promise();
return passThrough
}
// a single process will start the multipart upload
const uploadId startMultipartUpload();
async main() { // simulating a "never ending" flow of data
const stream = openMultiparStream(uploadId);
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main(); // all the processes will receive and upload to the same UploadId
finishMultipartUpload(uploadId); // only the last process to closm will finish the multipart upload.
Searching around, I came across the article from AWS the presents the upload() API method and says that it abstracts the multipart API to allow use of piped data streams to upload large files. So I wonder if its possible to obtain the uploadId from a streamed 'simple' upload, so I can pass this Id around the cluster and upload to the same object and still maintaining the streaming characteristic. Does anyone ever tried this type of scenario of a 'streamed multipart' upload?

Reading tab-separated files in gzip archive on S3 using Lambda (NodeJS)

I have the following use case to solve. I need to ingest data from a S3 bucket using a Lambda function (NodeJS 12). The Lambda function will be triggered when a new file is created. The file is a gz archive and can contain multiple TSV (tab-separated) files. For each row an API call will be triggered from the Lambda function. Questions:
1 - Does it have to be a two-steps process: uncompress the archive in a /tmp folder and then read the TSV files. Or can you directly stream the content of the archive file?
2 - Do you have a snippet of code that you could share that shows how to stream a GZ file from S3 bucket and its content (TSV)? I've found few examples but only for pure NodeJS. Not from Lambda/S3.
Thanks a lot for your help.
Adding a snippet of code for my first test and it doesnt work. No data is logged in the console
const csv = require('csv-parser')
const aws = require('aws-sdk');
const s3 = new aws.S3();
exports.handler = async(event, context, callback) => {
const bucket = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const params = { Bucket: bucket, Key: objectKey };
var results = [];
console.log("My File: "+objectKey+"\n")
console.log("My Bucket: "+bucket+"\n")
var otherOptions = {
columns: true,
auto_parse: true,
escape: '\\',
trim: true,
};
s3.getObject(params).createReadStream()
.pipe(csv({ separator: '|' }))
.on('data', (data) => results.push(data))
.on('end', () => {
console.log("My data: "+results);
});
return await results
};
You may want to take a look at the wiki:
Although its file format also allows for multiple [compressed files / data streams] to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file[5]), gzip is normally used to compress just single files. Compressed archives are typically created by assembling collections of files into a single tar archive (also called tarball), and then compressing that archive with gzip. The final compressed file usually has the extension .tar.gz or .tgz.
What this means is that by itself, gzip (or a Node package to use it) is not powerful enough to decompress a single .gz file into multiple files. I hope that if a single .gz item in S3 contains more than one file, it's actually a .tar.gz or similar compressed collection. To deal with these, check out
Simplest way to download and unzip files in NodeJS
You may also be interested in node-tar.
In terms of getting just one file out of the archive at at time, this depends on what the compressed collection actually is. Some compression schemes allow extracting just one file at a time, others don't (they require you decompress the whole thing in one go). Tar does the former.
First step should be to decompress .tar.gz file, using the package decompress
// typescript code for decompressing .tar.gz file
const decompress = require("decompress");
try {
const targzSrc = await aws.s3
.getObject({
Bucket: BUCKET_NAME,
Key: fileRequest.key
});
const filesPromise = decompress(targzSrc.Body);
const outputFileAsString = await filesPromise.then((files: any) => {
console.log("inflated file:", files["0"].data.toString("utf-8"));
return files["0"].data.toString("utf-8");
});
console.log("And here goes the file content:", outputFileAsString);
// here should be the code that parses the CSV content using the outputFileAsString
} catch (err) {
console.log("G_ERR:", err);
}

S3 fails to unzip uploaded file

I'm following this example
// Load the stream
var fs = require('fs'), zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
// Upload the stream
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}, function(err, data) {
if (err) console.log("An error occurred", err);
console.log("Uploaded the file at", data.Location);
})
And it "works" in that it does everything exactly as expected, EXCEPT that the file arrives on S3 compressed and stays that way.
As far as I can tell there's no auto facility for it to unzip it on S3, so, if your intention is to upload a publicly available image or video (or anything else that the end user is meant to simply consume) the solution appears to leave the uploaded file unzipped like so...
// Load the stream
var fs = require('fs'), zlib = require('zlib');
var body = fs.createReadStream('bigfile');//.pipe(zlib.createGzip()); <-- removing the zipping part
// Upload the stream
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}, function(err, data) {
if (err) console.log("An error occurred", err);
console.log("Uploaded the file at", data.Location);
})
I'm curious if I'm doing something wrong and if there IS an automatic way to have S3 recognize that the file is arriving zipped and unzip it?
The way this works is that s3 has now way of knowing that the file is gziped without a bit of help. You need to set the metadata on the file when uploading telling S3 that it's gzipped. It will do the right thing is this is set.
you need to set Content-Encoding: gzip and Content-Type: <<your file type>> in the object metadata when uploading.
Later edit:
Found these which explains how to do it for Cloudfront, but basically the same: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html#CompressedS3
http://www.cameronstokes.com/2011/07/20/compressed-content-amazon-s3-and-cloudfront/
However note that as per this blogpost S3 will serve the file gzipped and rely on the browser to unzip it. This works fine in many cases but as the blogger notes will fail in curl (since curl will have no idea what to do with the gzipped file). So if your intention is to simply upload a file for raw consumption by the user your best bet is to skip the gzipping and upload the file in its uncompressed state.

Is it possible to identify the length / size of a readable stream?

I'm trying to resize a file in node using gm.js and use the .stream() to create a readable stream of the resized image. Now I want to upload it using knox.js .putStream() but Content-Length is a required header. Is it possible to identify the size of the readable stream so I can use it in the Content-Length header?
Thanks in advance guys.
If your files aren't too large, you could bufferize your stream using raw-body module before uploading it to S3:
var rawBody = require('raw-body');
var knox = require('knox');
function putStream(stream, filepath, headers, next) {
rawBody(stream, function(err, buffer) {
if (err) return next(err);
headers['Content-Length'] = buffer.length;
knox.putBuffer(buffer, filepath, headers, next);
});
};
If your files are extremely large, it may be better to use mscdex's solution with knox-mpu module.

S3 file upload stream using node js

I am trying to find some solution to stream file on amazon S3 using node js server with requirements:
Don't store temp file on server or in memory. But up-to some limit not complete file, buffering can be used for uploading.
No restriction on uploaded file size.
Don't freeze server till complete file upload because in case of heavy file upload other request's waiting time will unexpectedly
increase.
I don't want to use direct file upload from browser because S3 credentials needs to share in that case. One more reason to upload file from node js server is that some authentication may also needs to apply before uploading file.
I tried to achieve this using node-multiparty. But it was not working as expecting. You can see my solution and issue at https://github.com/andrewrk/node-multiparty/issues/49. It works fine for small files but fails for file of size 15MB.
Any solution or alternative ?
You can now use streaming with the official Amazon SDK for nodejs in the section "Uploading a File to an Amazon S3 Bucket" or see their example on GitHub.
What's even more awesome, you finally can do so without knowing the file size in advance. Simply pass the stream as the Body:
var fs = require('fs');
var zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body})
.on('httpUploadProgress', function(evt) { console.log(evt); })
.send(function(err, data) { console.log(err, data) });
For your information, the v3 SDK were published with a dedicated module to handle that use case : https://www.npmjs.com/package/#aws-sdk/lib-storage
Took me a while to find it.
Give https://www.npmjs.org/package/streaming-s3 a try.
I used it for uploading several big files in parallel (>500Mb), and it worked very well.
It very configurable and also allows you to track uploading statistics.
You not need to know total size of the object, and nothing is written on disk.
If it helps anyone I was able to stream from the client to s3 successfully (without memory or disk storage):
https://gist.github.com/mattlockyer/532291b6194f6d9ca40cb82564db9d2a
The server endpoint assumes req is a stream object, I sent a File object from the client which modern browsers can send as binary data and added file info set in the headers.
const fileUploadStream = (req, res) => {
//get "body" args from header
const { id, fn } = JSON.parse(req.get('body'));
const Key = id + '/' + fn; //upload to s3 folder "id" with filename === fn
const params = {
Key,
Bucket: bucketName, //set somewhere
Body: req, //req is a stream
};
s3.upload(params, (err, data) => {
if (err) {
res.send('Error Uploading Data: ' + JSON.stringify(err) + '\n' + JSON.stringify(err.stack));
} else {
res.send(Key);
}
});
};
Yes putting the file info in the headers breaks convention but if you look at the gist it's much cleaner than anything else I found using streaming libraries or multer, busboy etc...
+1 for pragmatism and thanks to #SalehenRahman for his help.
I'm using the s3-upload-stream module in a working project here.
There is also some good examples from #raynos in his http-framework repository.
Alternatively you can look at - https://github.com/minio/minio-js. It has minimal set of abstracted API's implementing most commonly used S3 calls.
Here is an example of streaming upload.
$ npm install minio
$ cat >> put-object.js << EOF
var Minio = require('minio')
var fs = require('fs')
// find out your s3 end point here:
// http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
var s3Client = new Minio({
url: 'https://<your-s3-endpoint>',
accessKey: 'YOUR-ACCESSKEYID',
secretKey: 'YOUR-SECRETACCESSKEY'
})
var outFile = fs.createWriteStream('your_localfile.zip');
var fileStat = Fs.stat(file, function(e, stat) {
if (e) {
return console.log(e)
}
s3Client.putObject('mybucket', 'hello/remote_file.zip', 'application/octet-stream', stat.size, fileStream, function(e) {
return console.log(e) // should be null
})
})
EOF
putObject() here is a fully managed single function call for file sizes over 5MB it automatically does multipart internally. You can resume a failed upload as well and it will start from where its left off by verifying previously upload parts.
Additionally this library is also isomorphic, can be used in browsers as well.

Resources