S3 fails to unzip uploaded file - node.js

I'm following this example
// Load the stream
var fs = require('fs'), zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
// Upload the stream
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}, function(err, data) {
if (err) console.log("An error occurred", err);
console.log("Uploaded the file at", data.Location);
})
And it "works" in that it does everything exactly as expected, EXCEPT that the file arrives on S3 compressed and stays that way.
As far as I can tell there's no auto facility for it to unzip it on S3, so, if your intention is to upload a publicly available image or video (or anything else that the end user is meant to simply consume) the solution appears to leave the uploaded file unzipped like so...
// Load the stream
var fs = require('fs'), zlib = require('zlib');
var body = fs.createReadStream('bigfile');//.pipe(zlib.createGzip()); <-- removing the zipping part
// Upload the stream
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}, function(err, data) {
if (err) console.log("An error occurred", err);
console.log("Uploaded the file at", data.Location);
})
I'm curious if I'm doing something wrong and if there IS an automatic way to have S3 recognize that the file is arriving zipped and unzip it?

The way this works is that s3 has now way of knowing that the file is gziped without a bit of help. You need to set the metadata on the file when uploading telling S3 that it's gzipped. It will do the right thing is this is set.
you need to set Content-Encoding: gzip and Content-Type: <<your file type>> in the object metadata when uploading.
Later edit:
Found these which explains how to do it for Cloudfront, but basically the same: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html#CompressedS3
http://www.cameronstokes.com/2011/07/20/compressed-content-amazon-s3-and-cloudfront/
However note that as per this blogpost S3 will serve the file gzipped and rely on the browser to unzip it. This works fine in many cases but as the blogger notes will fail in curl (since curl will have no idea what to do with the gzipped file). So if your intention is to simply upload a file for raw consumption by the user your best bet is to skip the gzipping and upload the file in its uncompressed state.

Related

How to upload modified PDF file to AWS s3 from AWS Lambda

I've a requirement to
Download a PDF file from AWS S3 storage. (Key1)
Do some modifications.
Upload the modified PDF file back to S3 storage. (Key2)
The Uploaded file is a new file (K2). Not overwriting the existing file (K1)
Library used for modifying PDFs : pdf-lib
All the executions like downloading/modification/uploading of PDF are being done in AWS Lambda. The runtime is node.js 14.x
The objects in S3 bucket can be accessed through CDN as public access is blocked.
I'm able to download the file, then do the modifications and upload to S3. But when I open the file using CDN URL for the object, it is showing encoded text (garbage). Not the PDF preview of the file.
Downloading PDF file from S3.
const params = {
Bucket: bucket_name,
Key: key
};
// GET FILE AND RETURN PROMISE.
return new Promise((resolve, reject) => {
s3.getObject(params, (err, data) => {
if (err) {
reject(err);
}
try {
const obj = data.Body; // <<-- getting Uint8Array
resolve(obj);
} catch (e) {
reject(err);
}
});
});
Doing Modification on PDF file
async modificationFunction(opts) => {
const { fileData } = opts; //<<---- Unit8Array data from above snippet.
const pdfDoc = await PDFDocument.load(fileData);
// Do Some Modification like drawing lines.
const modifiedPDFData = await pdfDoc.saveAsBase64({ dataUri: true });
return modifiedPDFData; //<<--- Base64 data of modifications.
}
Uploading PDF file
const params = {
Bucket: bucket_name,
Key: key,
Body: data, //<<--- Base64 data of modification from above snippet
};
try {
await s3.upload(params).promise();
console.log('File uploaded:', `s3://${bucket_name}/${key}`);
}
Content of the PDF when viewed using CDN URL is attached. It is encoded/garbage content.
Same PDF when downloaded to laptop from AWS S3 using manual download from S3 bucket is showing the contents properly like a normal PDF file.
Referenced many online resources/stackoverflow threads:
link1
link2 Using the AWS SDK in javascript.
Tried ways with save() and saveAsBase64() methods of the pdf-lib nodejs library.
Tried to save the modified file locally. Upload this file manually to AWS S3 and access through CDN. Able to view the PDF properly this way. So there is some issue with how the file is uploaded to S3.
The issue was not with PDF file download, modification, upload operations. Actually the CDN had a caching policy due to which the initially generated garbage content files were getting served on further requests. After clearing the cache and trying again the files were properly viewable with the CDN URL.

Reading tab-separated files in gzip archive on S3 using Lambda (NodeJS)

I have the following use case to solve. I need to ingest data from a S3 bucket using a Lambda function (NodeJS 12). The Lambda function will be triggered when a new file is created. The file is a gz archive and can contain multiple TSV (tab-separated) files. For each row an API call will be triggered from the Lambda function. Questions:
1 - Does it have to be a two-steps process: uncompress the archive in a /tmp folder and then read the TSV files. Or can you directly stream the content of the archive file?
2 - Do you have a snippet of code that you could share that shows how to stream a GZ file from S3 bucket and its content (TSV)? I've found few examples but only for pure NodeJS. Not from Lambda/S3.
Thanks a lot for your help.
Adding a snippet of code for my first test and it doesnt work. No data is logged in the console
const csv = require('csv-parser')
const aws = require('aws-sdk');
const s3 = new aws.S3();
exports.handler = async(event, context, callback) => {
const bucket = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const params = { Bucket: bucket, Key: objectKey };
var results = [];
console.log("My File: "+objectKey+"\n")
console.log("My Bucket: "+bucket+"\n")
var otherOptions = {
columns: true,
auto_parse: true,
escape: '\\',
trim: true,
};
s3.getObject(params).createReadStream()
.pipe(csv({ separator: '|' }))
.on('data', (data) => results.push(data))
.on('end', () => {
console.log("My data: "+results);
});
return await results
};
You may want to take a look at the wiki:
Although its file format also allows for multiple [compressed files / data streams] to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file[5]), gzip is normally used to compress just single files. Compressed archives are typically created by assembling collections of files into a single tar archive (also called tarball), and then compressing that archive with gzip. The final compressed file usually has the extension .tar.gz or .tgz.
What this means is that by itself, gzip (or a Node package to use it) is not powerful enough to decompress a single .gz file into multiple files. I hope that if a single .gz item in S3 contains more than one file, it's actually a .tar.gz or similar compressed collection. To deal with these, check out
Simplest way to download and unzip files in NodeJS
You may also be interested in node-tar.
In terms of getting just one file out of the archive at at time, this depends on what the compressed collection actually is. Some compression schemes allow extracting just one file at a time, others don't (they require you decompress the whole thing in one go). Tar does the former.
First step should be to decompress .tar.gz file, using the package decompress
// typescript code for decompressing .tar.gz file
const decompress = require("decompress");
try {
const targzSrc = await aws.s3
.getObject({
Bucket: BUCKET_NAME,
Key: fileRequest.key
});
const filesPromise = decompress(targzSrc.Body);
const outputFileAsString = await filesPromise.then((files: any) => {
console.log("inflated file:", files["0"].data.toString("utf-8"));
return files["0"].data.toString("utf-8");
});
console.log("And here goes the file content:", outputFileAsString);
// here should be the code that parses the CSV content using the outputFileAsString
} catch (err) {
console.log("G_ERR:", err);
}

Serverless framework - uploading binary files to S3 become corrupted

I have an endpoint that takes in form data including a file. This file can be a text file, image, or pdf. I'm using busboy (v0.2.14) to parse the form data. That code looks like this:
let buffers = [];
file.on('data', data => buffers.push(data));
file.on('end', () => {
result.filename = filename;
result.contentType = mimetype;
// Concat the chunks into a Buffer
result.file = new Buffer.concat(buffers);
});
// ...
busboy.write(event.body, event.isBase64Encoded ? 'base64' : 'binary');
busboy.end();
However, when I push the file data up to S3 using the AWS SDK (v2.97.0), all the binary files are corrupted when I go to view them. This does not happen to text files. The S3 upload code looks like this:
static myPutObject(bucketName, fileName, data, contentType, acl) {
const params = {
Bucket: bucketName,
Key: fileName,
Body: data,
ACL: acl,
ContentType: contentType,
ContentEncoding: 'base64'
};
return new AWS.S3().putObject(params).promise();
}
I've tried everything that I can find on Stack Overflow or GitHub with no luck.
If you're using API gateway in the front. apiGateway will mangle the incoming binary unless you specifically enabled binary Media Types.
If you’re using SLS to deploy, then you can just add:
apiGateway:
binaryMediaTypes:
- '*/*'
in the provider section
Read here: https://serverless.com/framework/docs/providers/aws/events/apigateway#binary-media-types
S3 is an "object in" and "object out" store. It does not know whether your content is binary or text or utf-16 encoding. It stores all the bytes as it receives and serves them when requested.
Here is how we validated whether the problem is on S3 or with our code.
Write the binary file locally
Send the same file to S3
Download from S3
Verify local file hash and download file hash for file integrity
That will help you to verify binary file contents.
Hope it helps.

How do I read and upload a large file to s3?

I'm using Node.js .10.22 and q-fs
I'm trying to upload objects to S3, which stopped working once the objects were over a certain MB size.
Besides taking up all the memory on my machine, it gives me this error
RangeError: length > kMaxLength
at new Buffer (buffer.js:194:21)
When I try to use fs.read on the file.
Normally, when this works, I do s3.upload, and put the buffer in the Body field.
How do I handle large objects?
You'll want to use a streaming version of the API to pipe your readable filesystem stream directly to the S3 upload http request body stream provided by the s3 module you are using. Here's an example straight from the aws-sdk documentation
var fs = require('fs');
var body = fs.createReadStream('bigfile');
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body}).
on('httpUploadProgress', function(evt) { console.log(evt); }).
send(function(err, data) { console.log(err, data) });

S3 file upload stream using node js

I am trying to find some solution to stream file on amazon S3 using node js server with requirements:
Don't store temp file on server or in memory. But up-to some limit not complete file, buffering can be used for uploading.
No restriction on uploaded file size.
Don't freeze server till complete file upload because in case of heavy file upload other request's waiting time will unexpectedly
increase.
I don't want to use direct file upload from browser because S3 credentials needs to share in that case. One more reason to upload file from node js server is that some authentication may also needs to apply before uploading file.
I tried to achieve this using node-multiparty. But it was not working as expecting. You can see my solution and issue at https://github.com/andrewrk/node-multiparty/issues/49. It works fine for small files but fails for file of size 15MB.
Any solution or alternative ?
You can now use streaming with the official Amazon SDK for nodejs in the section "Uploading a File to an Amazon S3 Bucket" or see their example on GitHub.
What's even more awesome, you finally can do so without knowing the file size in advance. Simply pass the stream as the Body:
var fs = require('fs');
var zlib = require('zlib');
var body = fs.createReadStream('bigfile').pipe(zlib.createGzip());
var s3obj = new AWS.S3({params: {Bucket: 'myBucket', Key: 'myKey'}});
s3obj.upload({Body: body})
.on('httpUploadProgress', function(evt) { console.log(evt); })
.send(function(err, data) { console.log(err, data) });
For your information, the v3 SDK were published with a dedicated module to handle that use case : https://www.npmjs.com/package/#aws-sdk/lib-storage
Took me a while to find it.
Give https://www.npmjs.org/package/streaming-s3 a try.
I used it for uploading several big files in parallel (>500Mb), and it worked very well.
It very configurable and also allows you to track uploading statistics.
You not need to know total size of the object, and nothing is written on disk.
If it helps anyone I was able to stream from the client to s3 successfully (without memory or disk storage):
https://gist.github.com/mattlockyer/532291b6194f6d9ca40cb82564db9d2a
The server endpoint assumes req is a stream object, I sent a File object from the client which modern browsers can send as binary data and added file info set in the headers.
const fileUploadStream = (req, res) => {
//get "body" args from header
const { id, fn } = JSON.parse(req.get('body'));
const Key = id + '/' + fn; //upload to s3 folder "id" with filename === fn
const params = {
Key,
Bucket: bucketName, //set somewhere
Body: req, //req is a stream
};
s3.upload(params, (err, data) => {
if (err) {
res.send('Error Uploading Data: ' + JSON.stringify(err) + '\n' + JSON.stringify(err.stack));
} else {
res.send(Key);
}
});
};
Yes putting the file info in the headers breaks convention but if you look at the gist it's much cleaner than anything else I found using streaming libraries or multer, busboy etc...
+1 for pragmatism and thanks to #SalehenRahman for his help.
I'm using the s3-upload-stream module in a working project here.
There is also some good examples from #raynos in his http-framework repository.
Alternatively you can look at - https://github.com/minio/minio-js. It has minimal set of abstracted API's implementing most commonly used S3 calls.
Here is an example of streaming upload.
$ npm install minio
$ cat >> put-object.js << EOF
var Minio = require('minio')
var fs = require('fs')
// find out your s3 end point here:
// http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
var s3Client = new Minio({
url: 'https://<your-s3-endpoint>',
accessKey: 'YOUR-ACCESSKEYID',
secretKey: 'YOUR-SECRETACCESSKEY'
})
var outFile = fs.createWriteStream('your_localfile.zip');
var fileStat = Fs.stat(file, function(e, stat) {
if (e) {
return console.log(e)
}
s3Client.putObject('mybucket', 'hello/remote_file.zip', 'application/octet-stream', stat.size, fileStream, function(e) {
return console.log(e) // should be null
})
})
EOF
putObject() here is a fully managed single function call for file sizes over 5MB it automatically does multipart internally. You can resume a failed upload as well and it will start from where its left off by verifying previously upload parts.
Additionally this library is also isomorphic, can be used in browsers as well.

Resources