Import large pdf files to be indexed to Elastic Search - node.js

I am trying to large pdf files to elastic search to index them.
uploadPDFDocument: async (req, res, next) => {
try {
let data = req.body;
let client = await cloudSearchController.getElasticSearchClient();
const documentData = await fs.readFile("./large.pdf");
const encodedData = Buffer.from(documentData).toString('base64');
let document = {
id: 'my_id_7',
index: 'my-index-000001',
pipeline: 'attachment',
timeout: '5m',
body: {
data: encodedData
}
}
let response = await client.create(document);
console.log(response);
return res.status(200).send(response);
return true;
} catch (error) {
console.log(error.stack);
return next(error);
}
},
The above code works for small pdf files and I am able extract data from it and index it.
But for large pdf files I get timeout exception.
Is there any other way to this without time out issue?
I have read about fscrawler, filebeats and logstash but they all deal with logs not pdf files.

Related

Slow upload speed into amazon s3 upload

I'm trying to upload 50gb data into s3 bucket using nestJs(nodeJs) and angular 13.
I have thousand of marriage images with row files and I'm uploading these file into s3 bucket but it's take more time to upload.
I also enable transfer-acceleration into s3 configuration and all files upload into multi-part data.
Angular process:
user select thousand of file using file input
after i called one api for each file for upload into loop. for example i have 5 files so it will call 5 api one by one.
Backend process:
file upload into multipart data (using multi-part upload)
after file upload into bucket I will store bucket data into database
any one tell me how can I improve uploading speed into s3 bucket ?
NestJs code
async create(request, photos) {
try {
let allPromises = [];
photos.forEach(async (photo) => {
let promise = new Promise<void>((resolve, reject) => {
let file = new ChangeFileName().changeName(photo);
this.s3fileUploadService.upload(file, `event-gallery-photos/${request.event_id}`).then(async (response: any) => {
console.log(response)
if (response.Location) {
await this.eventPhotoEntity.save({
studio_id: request.studio_id,
client_id: request.client_id,
event_id: request.event_id,
file_name: file.originalname,
original_name: file.userFileName,
file_size: file.size
});
}
resolve();
}).catch((error) => {
console.log(error);
this.logger.error(`s3 file upload error : ${error.message}`);
reject();
})
});
allPromises.push(promise);
});
return Promise.all(allPromises).then(() => {
return new ResponseFormatter(HttpStatus.OK, "Created successfully");
}).catch(() => {
return new ResponseFormatter(HttpStatus.INTERNAL_SERVER_ERROR, "Something went wrong", {});
})
} catch (error) {
console.log(error);
this.logger.error(`event photo create : ${error.message}`);
return new ResponseFormatter(HttpStatus.INTERNAL_SERVER_ERROR, "Something went wrong", {});
}
}
Upload function
async upload(file, bucket) {
return new Promise(async (resolve, reject) => {
bucket = 'photo-swipes/' + bucket
const chunkSize = 1024 * 1024 * 5; // chunk size is set to 10MB
const iterations = Math.ceil(file.buffer.length / chunkSize); // number of chunks to be broken
let arr = [];
for (let i = 1; i <= iterations; i++) {
arr.push(i)
}
try {
let uploadId: any = await this.startUpload(file, bucket);
uploadId = uploadId.UploadId;
const parts = await Promise.allSettled(
arr.map(async (item, index) => {
return await this.uploadPart(
file.originalname,
file.buffer.slice((item - 1) * chunkSize, item * chunkSize),
uploadId,
item,
bucket
)
})
)
const failedParts = parts
.filter((part) => part.status === "rejected")
.map((part: any) => part.reason);
const succeededParts = parts
.filter((part) => part.status === "fulfilled")
.map((part: any) => part.value);
let retriedParts = [];
if (!failedParts.length) // if some parts got failed then retry
retriedParts = await Promise.all(
failedParts.map((item, index) => {
this.uploadPart(
file.originalname,
file.buffer.slice((item) * chunkSize, item * chunkSize),
uploadId,
item,
bucket
)
})
);
const data = await this.completeUpload(
file.originalname,
uploadId,
succeededParts, // needs sorted array
bucket
);
resolve(data);
} catch (err) {
console.error(err);
reject(err)
}
});
}
Bandwidth
This the typical issue of slowness when you upload many number of small files to s3 .
One large 50GB will be uploaded way faster than multiple small files of of total size 50 GB .
The best approach than can work for you to parallelize the upload and if you can use AWS CLI that would be very fast with multiple parallel connection .
Multiple part upload is for single large file
Transfer acceleration is also not going to help much here

retrieve image from mongodb and display on client side

Now I'm using express, node.js, and mongodb. I just saw that the images can be stored to mongodb with multer and grid fs storage and it works.
enter image description here
enter image description here
And I need to get back to client side. I guess the image can be converted from that chunk binary to image but I really sure how to do so. My ultimate purpose is to display menu with name, price, and picture from mongodb which I uploaded.
Does anyone know how to retrieve it and send image file from controller to boundary class?
Additional resources:
//this is entity class which is for obtaining information about image file
static async getImages(menu) {
try {
let filter = Object.values(menu.image)
const files = await this.files.find({ filename: { $in: filter } }).toArray()
let fileInfos = []
for (const file of files) {
let chunk = await this.chunks.find({ files_id: file._id }).toArray()
console.log(chunk.data)
fileInfos.push(chunk.data)
}
return fileInfos
} catch (err) {
console.log(`Unable to get files: ${err.message}`)
}
}
** so chunk object contains this**
{
_id: new ObjectId("627a28cda6d7935899174cd4"),
files_id: new ObjectId("627a28cda6d7935899174cd3"),
n: 0,
data: new Binary(Buffer.from("89504e470d0a1a0a0000000d49484452000000180000001808020000006f15aaaf0000000674524e530000000000006ea607910000009449444154789cad944b12c0200843a5e3fdaf9c2e3a636d093f95a586f004b5b5e30100c0b2f8daac6d1a25a144e4b74288325e5a23d6b6aea965b3e643e4243b2cc428f472908f35bb572dace8d4652e485bab83f4c84a0030b6347e3cb5cc28dbb84721ff23704c17a7661ad1ee96dc5f22ff5061f458e29621447e4ec8557ba585a99152b97bb4f5d5d68c92532b10f967bc015ce051246ff76d8b0000000049454e44ae426082", "hex"), 0)
}
//this is controller class
static async apiViewMenu(_req, res) {
try {
let menus = await MenusDAO.getAllMenus()
for (const menu of menus) {
menu.images = await ImagesDAO.getImages(menu)
}
//return menus list
res.json(menus)
} catch (err) {
res.status(400).json({ error: err.message })
}
}
I did not handle converting this buffer data to image because I do not know...

Why my react front end does not want to download my file sent from my express back end?

hope you can help me on this one!
Here is the situation: I want to download a file from my React front end by sending a request to a certain endpoint on my express back end.
Here is my controller for this route.
I build a query, parse the results to generate a csv file and send back that file.
When I console log the response on the front end side, the data is there, it goes through; however, no dialog open allowing the client to download the file on local disk.
module.exports.downloadFile = async (req, res) => {
const sql = await buildQuery(req.query, 'members', connection)
// Select the wanted data from the database
connection.query(sql, (err, results, fields) => {
if (err) throw err;
// Convert the json into csv
try{
const csv = parse(results);
// Save the file on server
fs.writeFileSync(__dirname + '/export.csv', csv)
res.setHeader('Content-disposition', 'attachment; filename=export.csv');
res.download(__dirname + '/export.csv');
} catch (err){
console.error(err)
}
// Reply with the csv file
// Delete the file
})
}
Follow one of these functions as an example to your client side code:
Async:
export const download = (url, filename) => {
fetch(url, {
mode: 'no-cors'
/*
* ALTERNATIVE MODE {
mode: 'cors'
}
*
*/
}).then((transfer) => {
return transfer.blob(); // RETURN DATA TRANSFERED AS BLOB
}).then((bytes) => {
let elm = document.createElement('a'); // CREATE A LINK ELEMENT IN DOM
elm.href = URL.createObjectURL(bytes); // SET LINK ELEMENTS CONTENTS
elm.setAttribute('download', filename); // SET ELEMENT CREATED 'ATTRIBUTE' TO DOWNLOAD, FILENAME PARAM AUTOMATICALLY
elm.click(); // TRIGGER ELEMENT TO DOWNLOAD
elm.remove();
}).catch((error) => {
console.log(error); // OUTPUT ERRORS, SUCH AS CORS WHEN TESTING NON LOCALLY
})
}
Sync:
export const download = async (url, filename) => {
let response = await fetch(url, {
mode: 'no-cors'
/*
* ALTERNATIVE MODE {
mode: 'cors'
}
*
*/
});
try {
let data = await response.blob();
let elm = document.createElement('a'); // CREATE A LINK ELEMENT IN DOM
elm.href = URL.createObjectURL(data); // SET LINK ELEMENTS CONTENTS
elm.setAttribute('download', filename); // SET ELEMENT CREATED 'ATTRIBUTE' TO DOWNLOAD, FILENAME PARAM AUTOMATICALLY
elm.click(); // TRIGGER ELEMENT TO DOWNLOAD
elm.remove();
}
catch(err) {
console.log(err);
}
}

how to make formidable not save to var/folders on nodejs and express app

I'm using formidable to parse incoming files and store them on AWS S3
When I was debugging the code I found out that formidable is first saving it to disk at /var/folders/ and overtime some unnecessary files are stacked up on disk which could lead to a big problem later.
It's very silly of me using a code without fully understanding it and now
I have to figure out how to either remove the parsed file after saving it to S3 or save it to s3 without storing it in disk.
But the question is how do I do it?
I would appreciate if someone could point me in the right direction
this is how i handle the files:
import formidable, { Files, Fields } from 'formidable';
const form = new formidable.IncomingForm();
form.parse(req, async (err: any, fields: Fields, files: Files) => {
let uploadUrl = await util
.uploadToS3({
file: files.uploadFile,
pathName: 'myPathName/inS3',
fileKeyName: 'file',
})
.catch((err) => console.log('S3 error =>', err));
}
This is how i solved this problem:
When I parse incoming form-multipart data I have access to all the details of the files. Because it's already parsed and saved to local disk on the server/my computer. So using the path variable given to me by formidable I unlink/remove that file using node's built-in fs.unlink function. Of course I remove the file after saving it to AWS S3.
This is the code:
import fs from 'fs';
import formidable, { Files, Fields } from 'formidable';
const form = new formidable.IncomingForm();
form.multiples = true;
form.parse(req, async (err: any, fields: Fields, files: Files) => {
const pathArray = [];
try {
const s3Url = await util.uploadToS3(files);
// do something with the s3Url
pathArray.push(files.uploadFileName.path);
} catch(error) {
console.log(error)
} finally {
pathArray.forEach((element: string) => {
fs.unlink(element, (err: any) => {
if (err) console.error('error:',err);
});
});
}
})
I also found a solution which you can take a look at here but due to the architecture if found it slightly hard to implement without changing my original code (or let's just say I didn't fully understand the given implementation)
I think i found it. According to the docs see options.fileWriteStreamHandler, "you need to have a function that will return an instance of a Writable stream that will receive the uploaded file data. With this option, you can have any custom behavior regarding where the uploaded file data will be streamed for. If you are looking to write the file uploaded in other types of cloud storages (AWS S3, Azure blob storage, Google cloud storage) or private file storage, this is the option you're looking for. When this option is defined the default behavior of writing the file in the host machine file system is lost."
const form = formidable({
fileWriteStreamHandler: someFunction,
});
EDIT: My whole code
import formidable from "formidable";
import { Writable } from "stream";
import { Buffer } from "buffer";
import { v4 as uuidv4 } from "uuid";
export const config = {
api: {
bodyParser: false,
},
};
const formidableConfig = {
keepExtensions: true,
maxFileSize: 10_000_000,
maxFieldsSize: 10_000_000,
maxFields: 2,
allowEmptyFiles: false,
multiples: false,
};
// promisify formidable
function formidablePromise(req, opts) {
return new Promise((accept, reject) => {
const form = formidable(opts);
form.parse(req, (err, fields, files) => {
if (err) {
return reject(err);
}
return accept({ fields, files });
});
});
}
const fileConsumer = (acc) => {
const writable = new Writable({
write: (chunk, _enc, next) => {
acc.push(chunk);
next();
},
});
return writable;
};
// inside the handler
export default async function handler(req, res) {
const token = uuidv4();
try {
const chunks = [];
const { fields, files } = await formidablePromise(req, {
...formidableConfig,
// consume this, otherwise formidable tries to save the file to disk
fileWriteStreamHandler: () => fileConsumer(chunks),
});
// do something with the files
const contents = Buffer.concat(chunks);
const bucketRef = storage.bucket("your bucket");
const file = bucketRef.file(files.mediaFile.originalFilename);
await file
.save(contents, {
public: true,
metadata: {
contentType: files.mediaFile.mimetype,
metadata: { firebaseStorageDownloadTokens: token },
},
})
.then(() => {
file.getMetadata().then((data) => {
const fileName = data[0].name;
const media_path = `https://firebasestorage.googleapis.com/v0/b/${bucketRef?.id}/o/${fileName}?alt=media&token=${token}`;
console.log("File link", media_path);
});
});
} catch (e) {
// handle errors
console.log("ERR PREJ ...", e);
}
}

Receive stream of text from Node.js in AngularJS

I have an application built using MongoDB as database, Node.js in back-end and AngularJS in front-end.
I need to export to cvs format file some large amount of data from MongoDB and make possible to the user download it. My first implementation retrieved data from MongoDB using stream (Mongoose), saved the file using writable stream and then returned the path to AngularJS download it.
This approach fails for bigger files as it takes longer to be written and the original request in AngularJS times out. I changed the back-end to, instead of writing the file to disk, stream the data to the response so AngularJS can receive them by chunks and download the file.
Node.js
router.get('/log/export', async(req: bloo.BlooRequest, res: Response) => {
res.setHeader('Content-Type', 'text/csv');
res.setHeader('Content-Disposition', 'attachment; filename=\"' + 'download-' + Date.now() + '.csv\"');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Pragma', 'no-cache');
const query = req.query;
const service = new LogService();
service
.exportUsageLog(req.user, query)
.pipe(res);
});
public exportUsageLog(user: User, query: any) {
let criteria = this.buildQuery(query);
let stream = this.dao.getLogs(criteria);
return stream
.pipe(treeTransform)
.pipe(csvTransform);
}
public getLogs(criteria: any): mongoose.QueryStream {
return this.logs
.aggregate([
{
'$match': criteria
}, {
'$lookup': {
'from': 'users',
'localField': 'username',
'foreignField': '_id',
'as': 'user'
}
}
])
.cursor({})
.exec()
.stream();
}
This Node.js implementation works as I can see the data using Postman application.
My issue is how to receive these chunks in AngularJS, write them to the file and make the download starts in client side after changed the back-end to stream the data instead of only returning the file path.
My current AngularJS code:
cntrl.export = () => {
LogService
.exportLog(cntrl.query)
.then((res: any) => {
if (res.data) {
console.dir(res.data);
let file = res.data;
let host = location.protocol + "//" + window.location.host.replace("www.", "");
let path = file.slice(7, 32);
let fileName = file.slice(32);
let URI = host + "/"+ path + fileName;
let link = document.createElement("a");
link.href = URI;
link.click();
}
})
.catch((err: any) => console.error(err));
}
angular
.module('log', [])
.factory('LogService', LogService);
function LogService(constants, $resource, $http) {
const
ExportLog = $resource(constants.api + 'modules/log/export');
exportLog(query: any) {
return ExportLog.get(query, res => res).$promise;
},
};
}
With the above code, "res" is undefined.
How could I achieve that?
Thanks in advance.

Resources