The current project I'm working on requires that multiple processes upload data to a single file in S3. This data comes from multiple sources in parallel, so in order to process all the sources as fast as possible we'll use multiple Nodejs instances to listen to the sources. There are memory and storage constraints, so load all ingested data to memory or store in disk and then perform a single upload is out of question.
To respect those constraints I implemented a streamed upload: it buffers a small portion of the data from a single source and pipes the buffer to an upload stream. This works really well when using a single nodejs process, but, as I mentioned, the goal is to process all sources in parallel. My first try was to open multiple streams to the same object key in the bucket. This simply overrides the file with the data from the last stream to close. So I discarded this option.
// code for the scenario above, where each process will open a separete stream to
// the same key and perform it's data ingestion and upload.
openStreamingUpload() {
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
const passThrough = new stream.PassThrough();
const params = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
Body: passThrough
};
s3
.upload(params)
.promise();
return passThrough;
}
async main() { // simulating a "never ending" flow of data
const stream = openStreamingUpload();
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main();
Next I went to try the multipart upload that the S3 API offers. At first, I create a multipart upload, obtain its ID and store it in a shared space. After that I try to open multiple multipart uploads on all nodejs processes the cluster will be using, with the same UploadId obtained beforehand. Each one of those multipart uploads should have a stream that will pipe the data received. The problem I came across was that the multipart upload needs to know the part length beforehand and as I'm piping a stream that I don't know when will close or the amount of data it will pipe, it's not possible to calculate it's size. Code inspired by this implementation.
const AWS = require('aws-sdk');
const s3 = new this.AWS.S3(/* s3 config */);
async startMultipartUpload()
const multiPartParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket'
};
const multipart = await s3.createMultipartUpload(multiPartParams).promise();
return multipart.UploadId;
}
async finishMultipartUpload(multipartUploadId) {
const finishingParams = {
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId
};
const data = await s3.completeMultipartUpload(finishingParams).promise();
return data;
}
async openMultiparStream(multipartUploadId) {
const stream = require('stream');
const passThrough = new stream.PassThrough();
const params = {
Body: passThrough.,
Key: 'final-s3-file.txt',
Bucket: 'my-bucket',
UploadId: multipartUploadId,
PartNumber: // how do I know this part number when it's, in principle, unbounded?
};
s3
.uploadPart(params)
.promise();
return passThrough
}
// a single process will start the multipart upload
const uploadId startMultipartUpload();
async main() { // simulating a "never ending" flow of data
const stream = openMultiparStream(uploadId);
let data = await receiveData();;
do {
stream.write(data);
data = await receiveData();
} while(data);
stram.close();
}
main(); // all the processes will receive and upload to the same UploadId
finishMultipartUpload(uploadId); // only the last process to closm will finish the multipart upload.
Searching around, I came across the article from AWS the presents the upload() API method and says that it abstracts the multipart API to allow use of piped data streams to upload large files. So I wonder if its possible to obtain the uploadId from a streamed 'simple' upload, so I can pass this Id around the cluster and upload to the same object and still maintaining the streaming characteristic. Does anyone ever tried this type of scenario of a 'streamed multipart' upload?
Related
I was practicing on this tutorial
https://www.youtube.com/watch?v=NZElg91l_ms&t=1234s
It is working absolutely like a charm for me but the thing is I am storing images of products I am storing them in bucket and lets say I upload 4 images they all are uploaded.
but when I am displaying them i got access denied error as I am displaying the list and repeated request are maybe detecting it as a spam
This is how i am trying to fetch them on my react app
//rest of data is from mysql datbase (product name,price)
//100+ products
{ products.map((row)=>{
<div className="product-hero"><img src=`http://localhost:3909/images/${row.imgurl}`</div>
<div className="text-center">{row.productName}</div>
})
}
as it fetch 100+ products from db and 100 images from aws it fails
Sorry for such detailed question but in short how can i fetch all product images from my bucket
Note I am aware that i can get only one image per call so how can I get all images one by one in my scenario
//download code in my app.js
const { uploadFile, getFileStream } = require('./s3')
const app = express()
app.get('/images/:key', (req, res) => {
console.log(req.params)
const key = req.params.key
const readStream = getFileStream(key)
readStream.pipe(res)
})
//s3 file
// uploads a file to s3
function uploadFile(file) {
const fileStream = fs.createReadStream(file.path)
const uploadParams = {
Bucket: bucketName,
Body: fileStream,
Key: file.filename
}
return s3.upload(uploadParams).promise()
}
exports.uploadFile = uploadFile
// downloads a file from s3
function getFileStream(fileKey) {
const downloadParams = {
Key: fileKey,
Bucket: bucketName
}
return s3.getObject(downloadParams).createReadStream()
}
exports.getFileStream = getFileStream
It appears that your code is sending image requests to your back-end, which retrieves the objects from Amazon S3 and then serves the images in response to the request.
A much better method would be to have the URLs in the HTML page point directly to the images stored in Amazon S3. This would be highly scalable and will reduce the load on your web server.
This would require the images to be public so that the user's web browser can retrieve the images. The easiest way to do this would be to add a Bucket Policy that grants GetObject access to all users.
Alternatively, if you do not wish to make the bucket public, you can instead generate Amazon S3 pre-signed URLs, which are time-limited URLs that provides temporary access to a private object. Your back-end can calculate the pre-signed URL with a couple of lines of code, and the user's web browser will then be able to retrieve private objects from S3 for display on the page.
I did sililar S3 image handling while I handle my blog's image upload functionality, but I did not use getFileStream() to upload my image.
Because nothing should be done until the image file is fully processed, I used fs.readFile(path, callback) instead to read the data.
My way will generate Buffer Data, but AWS S3 is smart enough to know to intercept this as image. (I have only added suffix in my filename, I don't know how to apply image headers...)
This is my part of code for reference:
fs.readFile(imgPath, (err, data) => {
if (err) { throw err }
// Once file is read, upload to AWS S3
const objectParams = {
Bucket: 'yuyuichiu-personal',
Key: req.file.filename,
Body: data
}
S3.putObject(objectParams, (err, data) => {
// store image link and read image with link
}
}
I need to write a lambda function and send a number in the api request field to genrate the number of QR codes and store them in a S3 bucket.I am using the serverless framework with the aws-nodejs template.
To describe the task briefly lets say I get a number input in the api request PathParameters and based on these number I have to generate those number of QR's using the qr npm package and then store these generated qr's in the s3 bucket
this is what i have been able to do so far.
module.exports.CreateQR = (event,context) =>{
const numberOfQR = JSON.parse(event.pathParameters.number) ;
for(let i=0;i<numberOfQR;i++){
var d= new Date();
async function createQr(){
let unique, cipher, raw, qrbase64;
unique = randomize('0', 16);
cipher = key.encrypt(unique);
raw = { 'version': '1.0', data: cipher, type: 'EC_LOAD'}
// linkArray.forEach( async (element,index) =>{
let qrcode = await qr.toDataURL(JSON.stringify(raw));
console.log(qrcode);
// fs.writeFileSync('./qr.html', `<img src="${qrcode}">`)
const params = {
Bucket:BUCKET_NAME,
Key:`QR/${d}/img${i+1}.jpg`,
Body: qrcode
};
s3.upload(params , function(err,data){
if(err){
throw err
}
console.log(`File uploaded Successfully .${data.Location}`);
});
}
createQr();
}
};
I have been able to upload a given number of images to the bucket but the issue i am facing is the images are not going in the order.I think the problem is with the asynchronous code. Any idea how to solve this issue
that's because you're not awaiting the s3 to upload, but instead you have a callback.
you should use the .promise of s3 and then await it, so you'll wait the file to be uploaded before move to the next one
I changed the example of code
See docs:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#upload-property
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3/ManagedUpload.html#promise-property
// see the async keyword before the lambda function
// we need it for use the await keyword and wait for a task to complete before continue
module.exports.CreateQR = async (event,context) =>{
const numberOfQR = JSON.parse(event.pathParameters.number) ;
// moved your function out of the loop
function createQr(){
// ...
// here we call .promise and return it so we get a Task back from s3
return s3.upload(params).promise();
}
for(let i=0;i<numberOfQR;i++){
// ....
// here we await for the task, so you will get the images being created and uploaded in order
await createQr();
}
};
hope it guide you towards a solution.
I have the following use case to solve. I need to ingest data from a S3 bucket using a Lambda function (NodeJS 12). The Lambda function will be triggered when a new file is created. The file is a gz archive and can contain multiple TSV (tab-separated) files. For each row an API call will be triggered from the Lambda function. Questions:
1 - Does it have to be a two-steps process: uncompress the archive in a /tmp folder and then read the TSV files. Or can you directly stream the content of the archive file?
2 - Do you have a snippet of code that you could share that shows how to stream a GZ file from S3 bucket and its content (TSV)? I've found few examples but only for pure NodeJS. Not from Lambda/S3.
Thanks a lot for your help.
Adding a snippet of code for my first test and it doesnt work. No data is logged in the console
const csv = require('csv-parser')
const aws = require('aws-sdk');
const s3 = new aws.S3();
exports.handler = async(event, context, callback) => {
const bucket = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const params = { Bucket: bucket, Key: objectKey };
var results = [];
console.log("My File: "+objectKey+"\n")
console.log("My Bucket: "+bucket+"\n")
var otherOptions = {
columns: true,
auto_parse: true,
escape: '\\',
trim: true,
};
s3.getObject(params).createReadStream()
.pipe(csv({ separator: '|' }))
.on('data', (data) => results.push(data))
.on('end', () => {
console.log("My data: "+results);
});
return await results
};
You may want to take a look at the wiki:
Although its file format also allows for multiple [compressed files / data streams] to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file[5]), gzip is normally used to compress just single files. Compressed archives are typically created by assembling collections of files into a single tar archive (also called tarball), and then compressing that archive with gzip. The final compressed file usually has the extension .tar.gz or .tgz.
What this means is that by itself, gzip (or a Node package to use it) is not powerful enough to decompress a single .gz file into multiple files. I hope that if a single .gz item in S3 contains more than one file, it's actually a .tar.gz or similar compressed collection. To deal with these, check out
Simplest way to download and unzip files in NodeJS
You may also be interested in node-tar.
In terms of getting just one file out of the archive at at time, this depends on what the compressed collection actually is. Some compression schemes allow extracting just one file at a time, others don't (they require you decompress the whole thing in one go). Tar does the former.
First step should be to decompress .tar.gz file, using the package decompress
// typescript code for decompressing .tar.gz file
const decompress = require("decompress");
try {
const targzSrc = await aws.s3
.getObject({
Bucket: BUCKET_NAME,
Key: fileRequest.key
});
const filesPromise = decompress(targzSrc.Body);
const outputFileAsString = await filesPromise.then((files: any) => {
console.log("inflated file:", files["0"].data.toString("utf-8"));
return files["0"].data.toString("utf-8");
});
console.log("And here goes the file content:", outputFileAsString);
// here should be the code that parses the CSV content using the outputFileAsString
} catch (err) {
console.log("G_ERR:", err);
}
I'm using Node.js .10.22 and q-fs
I'm trying to upload objects to S3, which stopped working once the objects were over a certain MB size.
Besides taking up all the memory on my machine, it gives me this error
RangeError: length > kMaxLength
at new Buffer (buffer.js:194:21)
When I try to use fs.read on the file.
Normally, when this works, I do s3.upload, and put the buffer in the Body field.
How do I handle large objects?
You'll want to use a streaming version of the API to pipe your readable filesystem stream directly to the S3 upload http request body stream provided by the s3 module you are using. Here's an example straight from the aws-sdk documentation
var fs = require('fs');
var body = fs.createReadStream('bigfile');
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}).
on('httpUploadProgress', function(evt) { console.log(evt); }).
send(function(err, data) { console.log(err, data) });
I am trying to find some solution to stream file on amazon S3 using node js server with requirements:
Don't store temp file on server or in memory. But up-to some limit not complete file, buffering can be used for uploading.
No restriction on uploaded file size.
Don't freeze server till complete file upload because in case of heavy file upload other request's waiting time will unexpectedly
increase.
I don't want to use direct file upload from browser because S3 credentials needs to share in that case. One more reason to upload file from node js server is that some authentication may also needs to apply before uploading file.
I tried to achieve this using node-multiparty. But it was not working as expecting. You can see my solution and issue at https://github.com/andrewrk/node-multiparty/issues/49. It works fine for small files but fails for file of size 15MB.
Any solution or alternative ?
You can now use streaming with the official Amazon SDK for nodejs in the section "Uploading a File to an Amazon S3 Bucket" or see their example on GitHub.
What's even more awesome, you finally can do so without knowing the file size in advance. Simply pass the stream as the Body:
var fs = require('fs');
var zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body})
.on('httpUploadProgress', function(evt) { console.log(evt); })
.send(function(err, data) { console.log(err, data) });
For your information, the v3 SDK were published with a dedicated module to handle that use case : https://www.npmjs.com/package/#aws-sdk/lib-storage
Took me a while to find it.
Give https://www.npmjs.org/package/streaming-s3 a try.
I used it for uploading several big files in parallel (>500Mb), and it worked very well.
It very configurable and also allows you to track uploading statistics.
You not need to know total size of the object, and nothing is written on disk.
If it helps anyone I was able to stream from the client to s3 successfully (without memory or disk storage):
https://gist.github.com/mattlockyer/532291b6194f6d9ca40cb82564db9d2a
The server endpoint assumes req is a stream object, I sent a File object from the client which modern browsers can send as binary data and added file info set in the headers.
const fileUploadStream = (req, res) => {
//get "body" args from header
const { id, fn } = JSON.parse(req.get('body'));
const Key = id + '/' + fn; //upload to s3 folder "id" with filename === fn
const params = {
Key,
Bucket: bucketName, //set somewhere
Body: req, //req is a stream
};
s3.upload(params, (err, data) => {
if (err) {
res.send('Error Uploading Data: ' + JSON.stringify(err) + '\n' + JSON.stringify(err.stack));
} else {
res.send(Key);
}
});
};
Yes putting the file info in the headers breaks convention but if you look at the gist it's much cleaner than anything else I found using streaming libraries or multer, busboy etc...
+1 for pragmatism and thanks to #SalehenRahman for his help.
I'm using the s3-upload-stream module in a working project here.
There is also some good examples from #raynos in his http-framework repository.
Alternatively you can look at - https://github.com/minio/minio-js. It has minimal set of abstracted API's implementing most commonly used S3 calls.
Here is an example of streaming upload.
$ npm install minio
$ cat >> put-object.js << EOF
var Minio = require('minio')
var fs = require('fs')
// find out your s3 end point here:
// http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
var s3Client = new Minio({
url: 'https://<your-s3-endpoint>',
accessKey: 'YOUR-ACCESSKEYID',
secretKey: 'YOUR-SECRETACCESSKEY'
})
var outFile = fs.createWriteStream('your_localfile.zip');
var fileStat = Fs.stat(file, function(e, stat) {
if (e) {
return console.log(e)
}
s3Client.putObject('mybucket', 'hello/remote_file.zip', 'application/octet-stream', stat.size, fileStream, function(e) {
return console.log(e) // should be null
})
})
EOF
putObject() here is a fully managed single function call for file sizes over 5MB it automatically does multipart internally. You can resume a failed upload as well and it will start from where its left off by verifying previously upload parts.
Additionally this library is also isomorphic, can be used in browsers as well.