I have an AWS S3 object and a read stream created on it like this:
const s3 = new AWS.S3();
const readStream = s3
.getObject(params)
.createReadStream()
.on('error', err => {
// do something
});
Now when the stream is not read to the end (e.g. the streaming is aborted by client) after 120 sec the error event is triggered with: TimeoutError: Connection timed out after 120000ms
How can I close the stream (or the entire S3 object)?
I tried readStream.destroy() that is documented here, but it does not work.
I was looking for a solution to a similar case and bumped into this thread.
There is an AWS Request abort method documented here which allows you to cancel the request without receiving all the data (it's a similar concept to node's http request).
Your code should look somewhat like this:
const s3 = new AWS.S3();
const request = s3.getObject(params);
const readStream = request.createReadStream()
.on('error', err => {
request.abort(); // and do something else also...
});
It may be on error, but in my case - I'm fetching data and I want to stop streaming when I've reached a certain point (i.e. found specific data in the file and it's only a matter of checking if it exists - I don't need anything else).
The above will work well with both request and node-fetch modules as well.
Related
The current project I'm working on requires that multiple processes upload data to a single file in S3. This data comes from multiple sources in parallel, so in order to process all the sources as fast as possible we'll use multiple Nodejs instances to listen to the sources. There are memory and storage constraints, so load all ingested data to memory or store in disk and then perform a single upload is out of question.
To respect those constraints I implemented a streamed upload: it buffers a small portion of the data from a single source and pipes the buffer to an upload stream. This works really well when using a single nodejs process, but, as I mentioned, the goal is to process all sources in parallel. My first try was to open multiple streams to the same object key in the bucket. This simply overrides the file with the data from the last stream to close. So I discarded this option.
// code for the scenario above, where each process will open a separete stream to
// the same key and perform it's data ingestion and upload.
openStreamingUpload() {
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
const passThrough = new stream.PassThrough();
const params = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
Body: passThrough
};
s3
.upload(params)
.promise();
return passThrough;
}
async main() { // simulating a "never ending" flow of data
const stream = openStreamingUpload();
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main();
Next I went to try the multipart upload that the S3 API offers. At first, I create a multipart upload, obtain its ID and store it in a shared space. After that I try to open multiple multipart uploads on all nodejs processes the cluster will be using, with the same UploadId obtained beforehand. Each one of those multipart uploads should have a stream that will pipe the data received. The problem I came across was that the multipart upload needs to know the part length beforehand and as I'm piping a stream that I don't know when will close or the amount of data it will pipe, it's not possible to calculate it's size. Code inspired by this implementation.
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
async startMultipartUpload()
const multiPartParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket'
};
const multipart = await s3.createMultipartUpload(multiPartParams).promise();
return multipart.UploadId;
}
async finishMultipartUpload(multipartUploadId) {
const finishingParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId
};
const data = await s3.completeMultipartUpload(finishingParams).promise();
return data;
}
async openMultiparStream(multipartUploadId) {
const stream = require('stream');
const passThrough = new stream.PassThrough();
const params = {
Body: passThrough.,
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId,
PartNumber: // how do I know this part number when it's, in principle, unbounded?
};
s3
.uploadPart(params)
.promise();
return passThrough
}
// a single process will start the multipart upload
const uploadId startMultipartUpload();
async main() { // simulating a "never ending" flow of data
const stream = openMultiparStream(uploadId);
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main(); // all the processes will receive and upload to the same UploadId
finishMultipartUpload(uploadId); // only the last process to closm will finish the multipart upload.
Searching around, I came across the article from AWS the presents the upload() API method and says that it abstracts the multipart API to allow use of piped data streams to upload large files. So I wonder if its possible to obtain the uploadId from a streamed 'simple' upload, so I can pass this Id around the cluster and upload to the same object and still maintaining the streaming characteristic. Does anyone ever tried this type of scenario of a 'streamed multipart' upload?
I am working on getting files from an SFTP server and piping the data to Box.com using their sdk. The Box sdk takes a readable stream as as parameter for uploading a file. The code that I have written to fetch the files from the sftp server uses the npm module ssh2-sftp-client.
The issue I am having is that a writable stream is "the end of the line" with streams unless you are using something like a Transform which is a Duplex and implements both read and write. Below is the code that I am using. Because I am working on this for a client I am intentionally leaving out some stuff that is not necessary.
Below is the method on the sftp class
async getFile(filepath: string): Promise<Readable> {
logger.info(`Fetching file: ${filepath}`);
const writable = new Writable();
const stream = new PassThrough();
await this.client.get(filepath, writable);
return writable.pipe(stream);
}
The implementation of getting a file and attempting to pipe to box which is an instance of an authorized BoxSDK client.
try {
for (const filename of filenames) {
const stream: Readable = await tmsClient.getFile(
'redacted' + filename,
);
logger.info(`Piping ${filename} to Box...`);
await box.createFile(filename, 'redacted', stream);
logger.info(`${filename} successfully downloaded`);
}
} catch (error) {
logger.error(`Failed to move files: ${error}`);
}
I am not super well versed in streams but based on my research I think this should work in theory.
I have also tried this implementation where the ssh client returns a buffer and then I try and pipe that buffer as a readable stream. With this implementation though I keep getting errors from the Box sdk that the stream ended unexpectedly.
async getFile(filepath: string): Promise<Readable> {
logger.info(`Fetching file: ${filepath}`);
const stream = new Readable();
const buffer = (await this.client.get(filepath)) as Buffer;
stream._read = (): void => {
stream.push(buffer);
stream.push(null);
};
return stream;
}
And the error message: 2020-02-06 15:24:57 error: Failed to move files: Error: Unexpected API Response [400 Bad Request] bad_request - Stream ended unexpectedly.
Any insight is greatly appreciated!
So after doing some more research into this it turns out that the issue is actually with the Box sdk for Node. The sdk is terminating the body of the stream before it is actually done. This is because under the hood they are using the request library which requires a content-length header to send large payloads. Without that in place it will continue to terminate the stream before the payload is sent.
On the Box community forum they suggest adding properties to the stream prototype to pass stuff to the underlying request library. I STRONGLY disagree with this because it is not the correct way to go about it. The Box sdk needs to provide a way to pass in the length of the content in Bytes. As the user of their API I should not have to manipulate their underlying dependencies. I am going to open an issue with their sdk and hopefully get this fixed.
Hope this is useful to someone else!
I'd like to use s3.getObject(params).createReadStream().pipe(writableStream) to directly write in the fs an object from s3 (the aim is to lower the RAM use).
My concern is really about the price, i have no idea how it works and want to avoid multiples GET calls against s3 which may be expensive.
My guess is that HTTP requests already works similarly like streams using raw data packet and aws-sdk just wrap it in a node stream. But another possibility would be to request consecutively parts of the object and therefore using several GET calls.
My searches were unsuccessful, do you have an idea?
The use of streams doesn't affect the price, it just changes the way you are going to handle the incoming data.
const s3 = new AWS.S3();
const source = s3.getObject({ Key: 'key', Bucket: 'bucket-name' }).createReadStream();
const target = fs.createWriteStream('/your/local/path/to/store/the/object');
source.pipe(target).on('end', () => {
console.log('Object stored...')
// Custom code here...
})
.on('error', err => {
console.log('Something went wrong...')
// Custom code here...
})
Hope this helps...
I have a firebase cloud function that uses express to streams a zip file of images to the client. When I test the cloud function locally it works fine. When I upload to firebase I get this error:
Error: Can't set headers after they are sent.
What could be causing this error? Memory limit?
export const zipFiles = async(name, params, response) => {
const zip = archiver('zip', {zlib: { level: 9 }});
const [files] = await storage.bucket(bucketName).getFiles({prefix:`${params.agent}/${params.id}/deliverables`});
if(files.length){
response.attachment(`${name}.zip`);
response.setHeader('Content-Type', 'application/zip');
response.setHeader('Access-Control-Allow-Origin', '*')
zip.pipe(output);
response.on('close', function() {
return output.send('OK').end(); // <--this is the line that fails
});
files.forEach((file, i) => {
const reader = storage.bucket(bucketName).file(file.name).createReadStream();
zip.append(reader, {name: `${name}-${i+1}.jpg`});
});
zip.finalize();
}else{
output.status(404).send('Not Found');
}
What Frank said in comments is true. You need to decide all your headers, including the HTTP status response, before you start sending any of the content body.
If you intend to express that you're sending a successful response, simply say output.status(200) in the same way that you did for your 404 error. Do that up front. When you're piping a response, you don't need to do anything to close the response in the end. When the pipe is done, the response will automatically be flushed and finalized. You're only supposed to call end() when you want to bail out early without sending a response at all.
Bear in mind that Cloud Functions only supports a maximum payload of 10MB (read more about limits), so if you're trying to zip up more than that total, it won't work. In fact, there is no "streaming" or chunked responses at all. The entire payload is being built in memory and transferred out as a unit.
I am trying to find some solution to stream file on amazon S3 using node js server with requirements:
Don't store temp file on server or in memory. But up-to some limit not complete file, buffering can be used for uploading.
No restriction on uploaded file size.
Don't freeze server till complete file upload because in case of heavy file upload other request's waiting time will unexpectedly
increase.
I don't want to use direct file upload from browser because S3 credentials needs to share in that case. One more reason to upload file from node js server is that some authentication may also needs to apply before uploading file.
I tried to achieve this using node-multiparty. But it was not working as expecting. You can see my solution and issue at https://github.com/andrewrk/node-multiparty/issues/49. It works fine for small files but fails for file of size 15MB.
Any solution or alternative ?
You can now use streaming with the official Amazon SDK for nodejs in the section "Uploading a File to an Amazon S3 Bucket" or see their example on GitHub.
What's even more awesome, you finally can do so without knowing the file size in advance. Simply pass the stream as the Body:
var fs = require('fs');
var zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body})
.on('httpUploadProgress', function(evt) { console.log(evt); })
.send(function(err, data) { console.log(err, data) });
For your information, the v3 SDK were published with a dedicated module to handle that use case : https://www.npmjs.com/package/#aws-sdk/lib-storage
Took me a while to find it.
Give https://www.npmjs.org/package/streaming-s3 a try.
I used it for uploading several big files in parallel (>500Mb), and it worked very well.
It very configurable and also allows you to track uploading statistics.
You not need to know total size of the object, and nothing is written on disk.
If it helps anyone I was able to stream from the client to s3 successfully (without memory or disk storage):
https://gist.github.com/mattlockyer/532291b6194f6d9ca40cb82564db9d2a
The server endpoint assumes req is a stream object, I sent a File object from the client which modern browsers can send as binary data and added file info set in the headers.
const fileUploadStream = (req, res) => {
//get "body" args from header
const { id, fn } = JSON.parse(req.get('body'));
const Key = id + '/' + fn; //upload to s3 folder "id" with filename === fn
const params = {
Key,
Bucket: bucketName, //set somewhere
Body: req, //req is a stream
};
s3.upload(params, (err, data) => {
if (err) {
res.send('Error Uploading Data: ' + JSON.stringify(err) + '\n' + JSON.stringify(err.stack));
} else {
res.send(Key);
}
});
};
Yes putting the file info in the headers breaks convention but if you look at the gist it's much cleaner than anything else I found using streaming libraries or multer, busboy etc...
+1 for pragmatism and thanks to #SalehenRahman for his help.
I'm using the s3-upload-stream module in a working project here.
There is also some good examples from #raynos in his http-framework repository.
Alternatively you can look at - https://github.com/minio/minio-js. It has minimal set of abstracted API's implementing most commonly used S3 calls.
Here is an example of streaming upload.
$ npm install minio
$ cat >> put-object.js << EOF
var Minio = require('minio')
var fs = require('fs')
// find out your s3 end point here:
// http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
var s3Client = new Minio({
url: 'https://<your-s3-endpoint>',
accessKey: 'YOUR-ACCESSKEYID',
secretKey: 'YOUR-SECRETACCESSKEY'
})
var outFile = fs.createWriteStream('your_localfile.zip');
var fileStat = Fs.stat(file, function(e, stat) {
if (e) {
return console.log(e)
}
s3Client.putObject('mybucket', 'hello/remote_file.zip', 'application/octet-stream', stat.size, fileStream, function(e) {
return console.log(e) // should be null
})
})
EOF
putObject() here is a fully managed single function call for file sizes over 5MB it automatically does multipart internally. You can resume a failed upload as well and it will start from where its left off by verifying previously upload parts.
Additionally this library is also isomorphic, can be used in browsers as well.