Chromium pdf renderer generates large pdf - node.js

I am trying to generate pdf from node using puppetter. We have large data, so we generate small html files(app 3mb) each, and then convert these html files to pdf's, then we merge these pdf's to generate the final pdf. The issue is the generated pdf is very large in size. for e.g pdf generated using paid pdflib generates 60mb of pdf, but pdf with similar content generated using puppetter is almost double the size (120mb). If we remove the image from the header(which is just 3kb in size), that also reduces the size to a considerable extent. Are there any flags or tricks to optimize the size of pdf generated using puppetter(chromium).

Related

Extracting Text from multiple pages of PDF using tesseract OCR in node.js

I am currently working on a project for extracting text from multi paged PDFs(these PDFs are generally circulars or application forms) using tesseractOCR (in node.js) ,since tesseract only takes images as input i am not able to pass the pdf. I need a code to help me pass each page of the pdf and get the result back(if the result of pages are appended then its not a problem).
I tried using pdf-poppler,I dont necessarily need to use pdf-poppler.
Technologies using : tesseractOCR for js,Node.js
Additional/optional info: Can i get suggesstion on some free open source OCR to use and how to parse the text i get.

How to redact texts in a pdf file in NodeJs

I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.

Node js - converting pdf to valid version

I have various pdf files which fail a certain logic process due to them being invalid.
I use - https://www.pdf-online.com/osa/validate.aspx
and when I validate a pdf, I get a message that says the pdf does not conform to the PDF 1.3 standard or 1.4 standard.
I'm familiar with converting the pdf to text/json/buffer and then rebuild it and save it as a new pdf file, but was wondering is there an alternative? Because each pdf is different and is basically user input and the rebuilding it using jspdf for example, will be different for every file.
Is it possible to convert such pdf document to conform to the PDF 1.3/1.4 standards?

Convert any document, image, text file into PDF

I want to convert any documents or image or text file into PDF for all the OS.
I tried the approach with node-msoffice-pdf, and its working fine for Windows OS but not working in other OS.
Question:
How to convert docs, images, textfile to pdf in nodejs?
I used wkhtmltopdf from years to manage pdf conversion.
https://github.com/devongovett/node-wkhtmltopdf
You can either render an html file and pass it to the module, or render a pdf directly from an url.
If fidelity/conversion quality is important to you, for Word documents (doc/docx) you could try our freemium https://www.npmjs.com/package/#nativedocuments/docx-wasm which will perform the conversion locally (ie where node is running), without the need to LibreOffice etc.

node.js read images from PDF

I need to use PDF in a way similar to ZIP/RAR. To hold many images (ancient tibetan buddist literature), ideally 60000. But splitting in 10-100 volumes is OK.
Anything can be used for packing, but for unpacking we need Node.js. Because same PDF file must be served on web. But some users will need to use whole PDF.
So the question is, what node module I can use to read any single arbitrary image from huge PDF? Example would really help.
Every image is a single page. (Or in otherwords every page is single image)
We have been using https://github.com/mirkokiefer/Node-Magick for this....
But the pngs we get out sometimes are fairly low quality..

Resources