Slow upload speed into amazon s3 upload - node.js

I'm trying to upload 50gb data into s3 bucket using nestJs(nodeJs) and angular 13.
I have thousand of marriage images with row files and I'm uploading these file into s3 bucket but it's take more time to upload.
I also enable transfer-acceleration into s3 configuration and all files upload into multi-part data.
Angular process:
user select thousand of file using file input
after i called one api for each file for upload into loop. for example i have 5 files so it will call 5 api one by one.
Backend process:
file upload into multipart data (using multi-part upload)
after file upload into bucket I will store bucket data into database
any one tell me how can I improve uploading speed into s3 bucket ?
NestJs code
async create(request, photos) {
try {
let allPromises = [];
photos.forEach(async (photo) => {
let promise = new Promise<void>((resolve, reject) => {
let file = new ChangeFileName().changeName(photo);
this.s3fileUploadService.upload(file, `event-gallery-photos/${request.event_id}`).then(async (response: any) => {
console.log(response)
if (response.Location) {
await this.eventPhotoEntity.save({
studio_id: request.studio_id,
client_id: request.client_id,
event_id: request.event_id,
file_name: file.originalname,
original_name: file.userFileName,
file_size: file.size
});
}
resolve();
}).catch((error) => {
console.log(error);
this.logger.error(`s3 file upload error : ${error.message}`);
reject();
})
});
allPromises.push(promise);
});
return Promise.all(allPromises).then(() => {
return new ResponseFormatter(HttpStatus.OK, "Created successfully");
}).catch(() => {
return new ResponseFormatter(HttpStatus.INTERNAL_SERVER_ERROR, "Something went wrong", {});
})
} catch (error) {
console.log(error);
this.logger.error(`event photo create : ${error.message}`);
return new ResponseFormatter(HttpStatus.INTERNAL_SERVER_ERROR, "Something went wrong", {});
}
}
Upload function
async upload(file, bucket) {
return new Promise(async (resolve, reject) => {
bucket = 'photo-swipes/' + bucket
const chunkSize = 1024 * 1024 * 5; // chunk size is set to 10MB
const iterations = Math.ceil(file.buffer.length / chunkSize); // number of chunks to be broken
let arr = [];
for (let i = 1; i <= iterations; i++) {
arr.push(i)
}
try {
let uploadId: any = await this.startUpload(file, bucket);
uploadId = uploadId.UploadId;
const parts = await Promise.allSettled(
arr.map(async (item, index) => {
return await this.uploadPart(
file.originalname,
file.buffer.slice((item - 1) * chunkSize, item * chunkSize),
uploadId,
item,
bucket
)
})
)
const failedParts = parts
.filter((part) => part.status === "rejected")
.map((part: any) => part.reason);
const succeededParts = parts
.filter((part) => part.status === "fulfilled")
.map((part: any) => part.value);
let retriedParts = [];
if (!failedParts.length) // if some parts got failed then retry
retriedParts = await Promise.all(
failedParts.map((item, index) => {
this.uploadPart(
file.originalname,
file.buffer.slice((item) * chunkSize, item * chunkSize),
uploadId,
item,
bucket
)
})
);
const data = await this.completeUpload(
file.originalname,
uploadId,
succeededParts, // needs sorted array
bucket
);
resolve(data);
} catch (err) {
console.error(err);
reject(err)
}
});
}
Bandwidth

This the typical issue of slowness when you upload many number of small files to s3 .
One large 50GB will be uploaded way faster than multiple small files of of total size 50 GB .
The best approach than can work for you to parallelize the upload and if you can use AWS CLI that would be very fast with multiple parallel connection .
Multiple part upload is for single large file
Transfer acceleration is also not going to help much here

Related

Import large pdf files to be indexed to Elastic Search

I am trying to large pdf files to elastic search to index them.
uploadPDFDocument: async (req, res, next) => {
try {
let data = req.body;
let client = await cloudSearchController.getElasticSearchClient();
const documentData = await fs.readFile("./large.pdf");
const encodedData = Buffer.from(documentData).toString('base64');
let document = {
id: 'my_id_7',
index: 'my-index-000001',
pipeline: 'attachment',
timeout: '5m',
body: {
data: encodedData
}
}
let response = await client.create(document);
console.log(response);
return res.status(200).send(response);
return true;
} catch (error) {
console.log(error.stack);
return next(error);
}
},
The above code works for small pdf files and I am able extract data from it and index it.
But for large pdf files I get timeout exception.
Is there any other way to this without time out issue?
I have read about fscrawler, filebeats and logstash but they all deal with logs not pdf files.

generated sitemaps are corrupted using sitemap library for node/js

I'm using a library called sitemap to generate files from an array of objects constructed during runtime. My goal is to upload these generated sitemaps to an S3 bucket.
So far, the function is hosted on AWS lambda and uploading generated files correctly to the bucket.
My problem is that, the generated sitemaps are corrupted. When I run the function locally, they get generated correctly without any issues.
Here's my handler:
module.exports.handler = async () => {
try {
console.log("inside handler....");
await clearGeneratedSitemapsFromTmpDir();
const sms = new SitemapAndIndexStream({
limit: 10000,
getSitemapStream: (i) => {
const sitemapStream = new SitemapStream({
lastmodDateOnly: true,
});
const linkPath = `/sitemap-${i + 1}.xml`;
const writePath = `/tmp/${linkPath}`;
sitemapStream.pipe(createWriteStream(resolve(writePath)));
return [new URL(linkPath, hostName).toString(), sitemapStream];
},
});
const data = await generateSiteMap();
sms.pipe(createWriteStream(resolve("/tmp/sitemap-index.xml")));
// data.forEach((item) => sms.write(item));
Readable.from(data).pipe(sms);
sms.end();
await uploadToS3();
await clearGeneratedSitemapsFromTmpDir();
} catch (error) {
console.log("🚀 ~ file: index.js ~ line 228 ~ exec ~ error", error);
Sentry.captureException(error);
}
};
The data variable has an array of around 11k items, so according to the code above, two sitemap files would be generated(first 10k, rest to second sitemap) in addition to a sitemap index where it lists the two generated sitemaps.
Here's my uploadToS3 function:
const uploadToS3 = async () => {
try {
console.log("uploading to s3....");
const files = await getGeneratedXmlFilesNames();
for (let i = 0; i < files.length; i += 1) {
const file = files[i];
const filePath = `/tmp/${file}`;
// const stream = createReadStream(resolve(filePath));
const fileRead = await readFileAsync(filePath, { encoding: "utf-8" });
const params = {
Body: fileRead,
Key: `${file}`,
ACL: "public-read",
ContentType: "application/xml",
ContentDisposition: "inline",
};
// const result = await s3Client.upload(params).promise();
const result = await s3Client.putObject(params).promise();
console.log(
"🚀 ~ file: index.js ~ line 228 ~ uploadToS3 ~ result",
result
);
}
} catch (error) {
console.log("uploadToS3 => error", error);
// Sentry.captureException(error);
}
};
And here's the function that cleans up the generated files from lambda's /tmp directory after upload to S3:
const clearGeneratedSitemapsFromTmpDir = async () => {
try {
console.log("cleaning up....");
const readLocalTempDirDir = await readDirAsync("/tmp");
const xmlFiles = readLocalTempDirDir.filter((file) =>
file.includes(".xml")
);
for (const file of xmlFiles) {
await unlinkAsync(`/tmp/${file}`);
console.log("deleting file....");
}
} catch (error) {
console.log(
"🚀 ~ file: index.js ~ line 207 ~ clearGeneratedSitemapsFromTmpDir ~ error",
error
);
}
};
My hunch is that the issue is related to streams as I haven't fully understood them yet.
Any help here is highly appreciated.
Side note: I tried to sleep for 10s before uploading, but that didn't work either.
As a workaround, I did this:
const data = await generateSiteMap();
const logger = createWriteStream(resolve("/tmp/all-urls.json.txt"), {
flags: "a",
});
data.forEach((el) => {
logger.write(JSON.stringify(el));
logger.write("\n");
});
logger.end();
const stream = lineSeparatedURLsToSitemapOptions(
createReadStream(resolve("/tmp/all-urls.json.txt"))
)
.pipe(sms)
.pipe(createWriteStream(resolve("/tmp/sitemap-index.xml")));
await new Promise((fulfill) => stream.on("finish", fulfill));
await uploadToS3();
await clearGeneratedSitemapsFromTmpDir();
Will keep question open in case somebody answers it correctly.

Unable to upload multiple images to AWS S3 if I don't first upload one image through a AWS NodeJS Lambda endpoint using Promises

I have the code below on AWS Lambda as an endpoint exposed through API Gateway. The point of this endpoint is to upload images to an S3 bucket. I've been experiencing an interesting bug and could use some help. This code is unable to upload multiple images to S3 if it does not first upload one image. I've listed the scenarios below. The reason I want to use Promises is because I intend to insert data into a mysql table in the same endpoint. Any advice or feedback will be greatly appreciated!
Code Successfully uploads multiple images:
Pass one image to the endpoint to upload to S3 first
Pass several images to the endpoint to upload to S3 after uploading one image first
Code fails to upload images:
Pass several images to the endpoint to upload to s3 first. A random amount of images might be uploaded, but it consistently fails to upload all of them. A 502 error code is returned because it failed to upload all images.
Code
const AWS = require('aws-sdk');
const s3 = new AWS.S3({});
function uploadAllImagesToS3(imageMap) {
console.log('in uploadAllImagesToS3')
return new Promise((resolve, reject) => {
awaitAll(imageMap, uploadToS3)
.then(results => {
console.log('awaitAllFinished. results: ' + results)
resolve(results)
})
.catch(e => {
console.log("awaitAllFinished error: " + e)
reject(e)
})
})
}
function awaitAll(imageMap, asyncFn) {
const promises = [];
imageMap.forEach((value, key) => {
promises.push(asyncFn(key, value));
})
console.log('promises length: ' + promises.length)
return Promise.all(promises)
}
function uploadToS3(key, value) {
return new Promise((resolve, reject) => {
console.log('Promise uploadToS3 | key: ' + key)
// [key, value] = [filePath, Image]
var params = {
"Body": value,
"Bucket": "userpicturebucket",
"Key": key
};
s3.upload(params, function (err, data) {
console.log('uploadToS3. s3.upload. data: ' + JSON.stringify(data))
if (err) {
console.log('error when uploading to s3 | error: ' + err)
reject(JSON.stringify(["Error when uploading data to S3", err]))
} else {
let response = {
"statusCode": 200,
"headers": {
"Access-Control-Allow-Origin": "http://localhost:3000"
},
"body": JSON.stringify(data),
"isBase64Encoded": false
};
resolve(JSON.stringify(["Successfully Uploaded data to S3", response]))
}
});
})
}
exports.handler = (event, context, callback) => {
if (event !== undefined) {
let jsonObject = JSON.parse(event.body)
let pictures = jsonObject.pictures
let location = jsonObject.pictureLocation
let imageMap = new Map()
for (let i = 0; i < pictures.length; i++) {
let base64Image = pictures[i].split('base64,', 2)
let decodedImage = Buffer.from(base64Image[1], 'base64'); // image string is after 'base64'
let base64Metadata = base64Image[0].split(';', 3) // data:image/jpeg,name=coffee.jpg,
let imageNameData = base64Metadata[1].split('=', 2)
let imageName = imageNameData[1]
var filePath = "test/" + imageName
imageMap.set(filePath, decodedImage)
}
const promises = [uploadAllImagesToS3(imageMap)]
Promise.all(promises)
.then(([uploadS3Response]) => {
console.log('return promise!! | uploadS3Response: ' + JSON.stringify([uploadS3Response]))
let res = {
body: JSON.stringify(uploadS3Response),
headers: {
"Access-Control-Allow-Origin": "http://localhost:3000"
}
};
callback(null, res);
})
.catch((err) => {
callback(err);
});
} else {
callback("No pictures were uploaded")
}
};
Reason for problem and solution :
After several hours of debugging this issue I realized what the error was! My Lambda endpoint was timing out early. The reason I was able to upload multiple images after first uploading one image was because my the lambda endpoint was being executed from a warm start - as it was already up and running. The scenario where I was unable to upload multiple images was actually only occurring when I would try to do so after not executing the endpoint in 10+ minutes - therefore a cold start. Therefore, the solution was to increase the Timeout from the default of 3 seconds. I increased it to 20 seconds, but might need to play around with that time.
How to increase the lambda timeout?
Open Lambda function
Scroll down to Basic Settings and select Edit
Increase time in Timeout
TLDR
This error was occurring because Lambda would timeout. Solution is to increase lambda timeout.

Reading a ZIP archive from S3, and writing uncompressed version to new bucket

I have an app where user can upload a ZIP archive of resources. My app handles the upload and saves this to S3. At some point I want to run a transformation that will read this S3 bucket unzip it, and write it to a new S3 bucket. This is all happening on a node service.
I am using the unzipper library to handle unzipping. Here is my initial code.
async function downloadFromS3() {
let s3 = new AWS.S3();
try {
const object = s3
.getObject({
Bucket: "zip-bucket",
Key: "Archive.zip"
})
.createReadStream();
object.on("error", err => {
console.log(err);
});
await streaming_unzipper(object, s3);
} catch (e) {
console.log(e);
}
}
async function streaming_unzipper(s3ObjectStream, s3) {
await s3.createBucket({ Bucket: "unzip-bucket" }).promise();
const unzipStream = s3ObjectStream.pipe(unzipper.Parse());
unzipStream.pipe(
stream.Transform({
objectMode: true,
transform: function(entry, e, next) {
const fileName = entry.path;
const type = entry.type; // 'Directory' or 'File'
const size = entry.vars.uncompressedSize; // There is also compressedSize;
if (type === "File") {
s3.upload(
{ Bucket: "unzip-bucket", Body: entry, Key: entry.path },
{},
function(err, data) {
if (err) console.error(err);
console.log(data);
entry.autodrain();
}
);
next();
} else {
entry.autodrain();
next();
}
}
})
);
This code is works but I feel like it could be optimized. Ideally I would like to pipe the download stream -> unzipper stream -> uploader stream. So that chunks are uploaded to S3 as they get unzipped, instead of waiting for the full fill uzip to finish then uploading.
The problem I am running into is that I need the file name (to set as an S3 key), which I only have after unzipping. Before I can start to upload.
Is there any good way to create a streaming upload to S3. Initiated with a temporaryId, that gets rewritten with the final final name after the full stream is finished.

multiple Images are not fully uploaded to S3 upon first lambda call

i have issue with uploading multiple files to s3.
what my lambda is doing:
1. uploading single file to s3 ( always work).
2. resizing the file to 4 new sizes (using sharp).
3. upload the resized files to s3.
the problem : sometimes only 2 or 3 out of 4 resized files are uploaded.
the surprising thing is that i noticed that on the next upload - the missing files from previous upload are added.
have no errors, i was thinking that this can be async issue so i awaited the right places to make it Synchronous.
will appreciate any help.
async function uploadImageArrToS3(resizeImagesResponse) {
return new Promise(async function (resolve, reject) {
var params = {
Bucket: bucketName,
ACL: 'public-read'
};
let uploadImgArr = resizeImagesResponse.map(async (buffer) => {
params.Key = buffer.imgParamsArray.Key;
params.Body = buffer.imgParamsArray.Body;
params.ContentType = buffer.imgParamsArray.ContentType;
let filenamePath = await s3.putObject(params, (e, d) => {
if (e) {
reject(e);
} else {
d.name = params.ContentType;
return (d.name);
}
}).params.Key
let parts = filenamePath.split("/");
let fileName = parts[parts.length - 1];
return {
fileName: fileName,
width: buffer.width
};
});
await Promise.all(uploadImgArr).then(function (resizedFiles) {
console.log('succesfully resized the image!');
resolve(resizedFiles);
});
})
}

Resources