How does the createReadStream() works in aws-sdk? - node.js

I'd like to use s3.getObject(params).createReadStream().pipe(writableStream) to directly write in the fs an object from s3 (the aim is to lower the RAM use).
My concern is really about the price, i have no idea how it works and want to avoid multiples GET calls against s3 which may be expensive.
My guess is that HTTP requests already works similarly like streams using raw data packet and aws-sdk just wrap it in a node stream. But another possibility would be to request consecutively parts of the object and therefore using several GET calls.
My searches were unsuccessful, do you have an idea?

The use of streams doesn't affect the price, it just changes the way you are going to handle the incoming data.
const s3 = new AWS.S3();
const source = s3.getObject({ Key: 'key', Bucket: 'bucket-name' }).createReadStream();
const target = fs.createWriteStream('/your/local/path/to/store/the/object');
source.pipe(target).on('end', () => {
console.log('Object stored...')
// Custom code here...
})
.on('error', err => {
console.log('Something went wrong...')
// Custom code here...
})
Hope this helps...

Related

Use streams in a multipart upload to S3

The current project I'm working on requires that multiple processes upload data to a single file in S3. This data comes from multiple sources in parallel, so in order to process all the sources as fast as possible we'll use multiple Nodejs instances to listen to the sources. There are memory and storage constraints, so load all ingested data to memory or store in disk and then perform a single upload is out of question.
To respect those constraints I implemented a streamed upload: it buffers a small portion of the data from a single source and pipes the buffer to an upload stream. This works really well when using a single nodejs process, but, as I mentioned, the goal is to process all sources in parallel. My first try was to open multiple streams to the same object key in the bucket. This simply overrides the file with the data from the last stream to close. So I discarded this option.
// code for the scenario above, where each process will open a separete stream to
// the same key and perform it's data ingestion and upload.
openStreamingUpload() {
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
const passThrough = new stream.PassThrough();
const params = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
Body: passThrough
};
s3
.upload(params)
.promise();
return passThrough;
}
async main() { // simulating a "never ending" flow of data
const stream = openStreamingUpload();
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main();
Next I went to try the multipart upload that the S3 API offers. At first, I create a multipart upload, obtain its ID and store it in a shared space. After that I try to open multiple multipart uploads on all nodejs processes the cluster will be using, with the same UploadId obtained beforehand. Each one of those multipart uploads should have a stream that will pipe the data received. The problem I came across was that the multipart upload needs to know the part length beforehand and as I'm piping a stream that I don't know when will close or the amount of data it will pipe, it's not possible to calculate it's size. Code inspired by this implementation.
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
async startMultipartUpload()
const multiPartParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket'
};
const multipart = await s3.createMultipartUpload(multiPartParams).promise();
return multipart.UploadId;
}
async finishMultipartUpload(multipartUploadId) {
const finishingParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId
};
const data = await s3.completeMultipartUpload(finishingParams).promise();
return data;
}
async openMultiparStream(multipartUploadId) {
const stream = require('stream');
const passThrough = new stream.PassThrough();
const params = {
Body: passThrough.,
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId,
PartNumber: // how do I know this part number when it's, in principle, unbounded?
};
s3
.uploadPart(params)
.promise();
return passThrough
}
// a single process will start the multipart upload
const uploadId startMultipartUpload();
async main() { // simulating a "never ending" flow of data
const stream = openMultiparStream(uploadId);
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main(); // all the processes will receive and upload to the same UploadId
finishMultipartUpload(uploadId); // only the last process to closm will finish the multipart upload.
Searching around, I came across the article from AWS the presents the upload() API method and says that it abstracts the multipart API to allow use of piped data streams to upload large files. So I wonder if its possible to obtain the uploadId from a streamed 'simple' upload, so I can pass this Id around the cluster and upload to the same object and still maintaining the streaming characteristic. Does anyone ever tried this type of scenario of a 'streamed multipart' upload?

Node, Express, and parsing streamed JSON in endpoint without blocking thread

I'd like to provide an endpoint in my API to allow third-parties to send large batches of JSON data. I'm free to define the format of the JSON objects, but my initial thought is a simple array of objects:
{[{"id":1, "name":"Larry"}, {"id":2, "name":"Curly"}, {"id":3, "name":"Moe"}]}
As there could be any number of these objects in the array, I'd need to stream this data in, read each of these objects as they're streamed in, and persist them somewhere.
TL;DR: Stream a large array of JSON objects from the body of an Express POST request.
It's easy to get the most basic of examples out there working as all of them seem to demonstrate this idea using "fs" and working w/ the filesystem.
What I've been struggling with is the Express implementation of this. At this point, I think I've got this working using the "stream-json" package:
const express = require("express");
const router = express.Router();
const StreamArray = require("stream-json/streamers/StreamArray");
router.post("/filestream", (req, res, next) => {
const stream = StreamArray.withParser();
req.pipe(stream).on("data", ({key, value}) => {
console.log(key, value);
}).on("finish", () => {
console.log("FINISH!");
}).on("error", e => {
console.log("Stream error :(");
});
res.status(200).send("Finished successfully!");
});
I end up with a proper readout of each object as it's parsed by stream-json. The problem seems to be with the thread getting blocked while the processing is happening. I can hit this once and immediately get the 200 response, but a second hit blocks the thread until the first batch finishes, while the second also begins.
Is there any way to do something like this w/o spawning a child process, or something like that? I'm unsure what to do with this, so that the endpoint can continue to receive requests while streaming/parsing the individual JSON objects.

How to close an AWS S3 read stream (AWSJavaScriptSDK)

I have an AWS S3 object and a read stream created on it like this:
const s3 = new AWS.S3();
const readStream = s3
.getObject(params)
.createReadStream()
.on('error', err => {
// do something
});
Now when the stream is not read to the end (e.g. the streaming is aborted by client) after 120 sec the error event is triggered with: TimeoutError: Connection timed out after 120000ms
How can I close the stream (or the entire S3 object)?
I tried readStream.destroy() that is documented here, but it does not work.
I was looking for a solution to a similar case and bumped into this thread.
There is an AWS Request abort method documented here which allows you to cancel the request without receiving all the data (it's a similar concept to node's http request).
Your code should look somewhat like this:
const s3 = new AWS.S3();
const request = s3.getObject(params);
const readStream = request.createReadStream()
.on('error', err => {
request.abort(); // and do something else also...
});
It may be on error, but in my case - I'm fetching data and I want to stop streaming when I've reached a certain point (i.e. found specific data in the file and it's only a matter of checking if it exists - I don't need anything else).
The above will work well with both request and node-fetch modules as well.

S3 file upload stream using node js

I am trying to find some solution to stream file on amazon S3 using node js server with requirements:
Don't store temp file on server or in memory. But up-to some limit not complete file, buffering can be used for uploading.
No restriction on uploaded file size.
Don't freeze server till complete file upload because in case of heavy file upload other request's waiting time will unexpectedly
increase.
I don't want to use direct file upload from browser because S3 credentials needs to share in that case. One more reason to upload file from node js server is that some authentication may also needs to apply before uploading file.
I tried to achieve this using node-multiparty. But it was not working as expecting. You can see my solution and issue at https://github.com/andrewrk/node-multiparty/issues/49. It works fine for small files but fails for file of size 15MB.
Any solution or alternative ?
You can now use streaming with the official Amazon SDK for nodejs in the section "Uploading a File to an Amazon S3 Bucket" or see their example on GitHub.
What's even more awesome, you finally can do so without knowing the file size in advance. Simply pass the stream as the Body:
var fs = require('fs');
var zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body})
.on('httpUploadProgress', function(evt) { console.log(evt); })
.send(function(err, data) { console.log(err, data) });
For your information, the v3 SDK were published with a dedicated module to handle that use case : https://www.npmjs.com/package/#aws-sdk/lib-storage
Took me a while to find it.
Give https://www.npmjs.org/package/streaming-s3 a try.
I used it for uploading several big files in parallel (>500Mb), and it worked very well.
It very configurable and also allows you to track uploading statistics.
You not need to know total size of the object, and nothing is written on disk.
If it helps anyone I was able to stream from the client to s3 successfully (without memory or disk storage):
https://gist.github.com/mattlockyer/532291b6194f6d9ca40cb82564db9d2a
The server endpoint assumes req is a stream object, I sent a File object from the client which modern browsers can send as binary data and added file info set in the headers.
const fileUploadStream = (req, res) => {
//get "body" args from header
const { id, fn } = JSON.parse(req.get('body'));
const Key = id + '/' + fn; //upload to s3 folder "id" with filename === fn
const params = {
Key,
Bucket: bucketName, //set somewhere
Body: req, //req is a stream
};
s3.upload(params, (err, data) => {
if (err) {
res.send('Error Uploading Data: ' + JSON.stringify(err) + '\n' + JSON.stringify(err.stack));
} else {
res.send(Key);
}
});
};
Yes putting the file info in the headers breaks convention but if you look at the gist it's much cleaner than anything else I found using streaming libraries or multer, busboy etc...
+1 for pragmatism and thanks to #SalehenRahman for his help.
I'm using the s3-upload-stream module in a working project here.
There is also some good examples from #raynos in his http-framework repository.
Alternatively you can look at - https://github.com/minio/minio-js. It has minimal set of abstracted API's implementing most commonly used S3 calls.
Here is an example of streaming upload.
$ npm install minio
$ cat >> put-object.js << EOF
var Minio = require('minio')
var fs = require('fs')
// find out your s3 end point here:
// http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
var s3Client = new Minio({
url: 'https://<your-s3-endpoint>',
accessKey: 'YOUR-ACCESSKEYID',
secretKey: 'YOUR-SECRETACCESSKEY'
})
var outFile = fs.createWriteStream('your_localfile.zip');
var fileStat = Fs.stat(file, function(e, stat) {
if (e) {
return console.log(e)
}
s3Client.putObject('mybucket', 'hello/remote_file.zip', 'application/octet-stream', stat.size, fileStream, function(e) {
return console.log(e) // should be null
})
})
EOF
putObject() here is a fully managed single function call for file sizes over 5MB it automatically does multipart internally. You can resume a failed upload as well and it will start from where its left off by verifying previously upload parts.
Additionally this library is also isomorphic, can be used in browsers as well.

Advice: flatiron, formidable and aws s3

I'm new with serverside programming with node.js. I'm sticking together a tiny webapp with it right now and having the usual startup learning to do. The following piece of code WORKS. But I would love to know if it's more or less a right way to do a simple file upload from a form and throw it into aws s3:
app.router.post('/form', { stream: true }, function () {
var req = this.req,
res = this.res,
form = new formidable.IncomingForm();
form
.parse(req, function(err, fields, files) {
console.log('Parsed file upload' + err);
if (err) {
res.end('error: Upload failed: ' + err);
} else {
var img = fs.readFileSync(files.image.path);
var data = {
Bucket: 'le-bucket',
Key: files.image.name,
Body: img
};
s3.client.putObject(data, function() {
console.log("Successfully uploaded data to myBucket/myKey");
});
res.end('success: Uploaded file(s)');
}
});
});
Note: I had to turn buffer off in union / flatiron.plugins.http.
What I would like to learn is, when to stream load a file and when to syncload it. It will be a really tiny webapp with little traffic.
If it's more or less good then please consider this as a token of working code which I also would throw into a gist. It's not that easy to find documenation and working examples of this kind of stuff. I like flatiron alot. But it's small module approach leads to lots of splattered docs and examples all over the net, speak alone of tutorials.
You should use other module than formidable because as far as I know formidable does not have s3 storage option , then you must save the files in your server before uploading it.
I would recommend you to use : multiparty
Use this example in order to upload directly to S3 without saving it locally in your server.

Resources