How to redact texts in a pdf file in NodeJs - node.js

I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.

Related

Extracting Text from multiple pages of PDF using tesseract OCR in node.js

I am currently working on a project for extracting text from multi paged PDFs(these PDFs are generally circulars or application forms) using tesseractOCR (in node.js) ,since tesseract only takes images as input i am not able to pass the pdf. I need a code to help me pass each page of the pdf and get the result back(if the result of pages are appended then its not a problem).
I tried using pdf-poppler,I dont necessarily need to use pdf-poppler.
Technologies using : tesseractOCR for js,Node.js
Additional/optional info: Can i get suggesstion on some free open source OCR to use and how to parse the text i get.

Convert any document, image, text file into PDF

I want to convert any documents or image or text file into PDF for all the OS.
I tried the approach with node-msoffice-pdf, and its working fine for Windows OS but not working in other OS.
Question:
How to convert docs, images, textfile to pdf in nodejs?
I used wkhtmltopdf from years to manage pdf conversion.
https://github.com/devongovett/node-wkhtmltopdf
You can either render an html file and pass it to the module, or render a pdf directly from an url.
If fidelity/conversion quality is important to you, for Word documents (doc/docx) you could try our freemium https://www.npmjs.com/package/#nativedocuments/docx-wasm which will perform the conversion locally (ie where node is running), without the need to LibreOffice etc.

Easily differentiate video files from image files in Node

I'm building a project where people can upload files, I would like to then display those files in a browser where people can interact with them (vote, comment etc)
However, this means I need to programatically build the html depending on the format of the video or image. Is there a way to feed a file (or filename) into a library, and determine whether I need to display it in a video element or an image element? Even a list of video formats vs image formats would help but I haven't seen anything in regards to that.
No module can reliably determine the file type. The user could either change the extension or even the magic number of the file to obfuscate it. The only reliable way it to try to pass file to some image / video transcoder to let it decide or error out if the format is invalid. This way you know you are working with known formats since all files are transcoded to your specific extensions. That could be mp4 or png. I recommend using handbrake for videos and sharp for images. Leaving the NPM links down below:
https://www.npmjs.com/package/handbrake-js
https://www.npmjs.com/package/sharp

Is it possible to add an image to a PDF without rendering the PDF?

I'm looking at adding an image to an existing PDF in Node.js. None of the PDF libraries I found appear to have the ability to modify an existing PDF though, so I'm planning on implementing it myself. I'm trying to figure out if it's too much work, as I can always do it server side using iTextPDF instead, but I'd prefer to do it in my app (Electron which uses Node.js).
If I just want to modify an existing PDF and add an image, will I have to write a complete rendering library or is PDF structured in such a way that I can write a very small parser that just gets the page I want and inserts an image using the correct format?
Specifically, I'm asking because I've previously looked into writing a text extraction library, put in order to get the position of text you have to render pretty much the entire PDF because of how positioning is handled. That's too much work to get around server side processing in this case.
To be clear, just asking if it's possible to do, not how to do it (don't want to be too broad, I'm sure I can figure that part out).
To perform a small manipulation of a PDF, you'll need to implement generalized reading, decompression, encryption and traversal of PDF data structures. Some of the thing you would need to handle include:
basic parsing of PDF syntax
indexing via the cross reference index, and/or cross reference index and object streams
objects (num, byte-string, hex string, dictionary, arrays, booleans...)
filters and variants (LZW, Flate, RunLength, Predictors)
encryption (RC4, AES, Custom security handlers)
page tree traversal
basic handling of page content streams
image handling
serialization, either rewriting of the entire PDF, or incremental updates to an existing PDF
Anything's possible, but realistically, you will need a PDF library or toolkit, client or server-side, to accomplish this.

Attach an image to an existing PDF at the right position

So, here's the thing, we have a bunch of pdf forms for users on our website to fill out, we have a submission button inside every pdf form that sends the filled data through a http post method. We are already doing a pre-filled form, where the user fill general information on a html form, data that are used to generate a partially filled multiple pdf files, so whatever the pdf form the user has selected to fill, all of them will be regenerated having the pre-filled information that he just typed on the html form. I accomplish this by using pdftk (http://www.pdflabs.com/docs/pdftk-man-page/) that enables me to just do that. But we get to a point that to make this more efficient, the user must be able to draw their signature on the html side at the pre filled form stage, so we are using signature-pad for this (http://thomasjbradley.ca/lab/signature-pad/), than we create an image from the canvas that the plugin generates. But as each pdf has its own position for the signature, we must insert a placeholder into the pdf that'll be replaced for the signature image. So I came up with the idea to create a disabled text field on the pdf with the name of "signature", so through a nodejs application with the pdf2json module (https://github.com/modesty/pdf2json) I can parse the pdf form and get the position (x,y,w,h) of that particular field, being able to attach an image over the text field placeholder. So the problem is that pdftk don't give me support to attach an image to an existing pdf or even attach it in a certain position, I tried to find a nodejs module that would enables me to do that, but the only worthing nodejs module that I found was pdfkit (http://pdfkit.org/) but it only works creating a new pdf, not editing an exiting one, I looked into pdfkit source code and I discarded it because I realize that it wont work to my case of an existing pdf. So I came to a long way, got to the final stage of this implementation, came to the last step to get this working, and I'm just stack.
This is the output that I have from the pdf2json module for nodejs that is helping with the placeholder approach.
{
page: 7,
index: 317,
name: 'signature',
type: 'alpha',
x: 43.806640625,
y: 14.64195833333333,
w: 30.546828125000005,
h: 1.9339166666666756
}
If someone know any server application that I could run through an unix command at my server to attach an image over an existing pdf document, it'll fit my needs, don't need to be an exclusively nodejs module.
Obs.: I already checked it out the adobe echosign product, but it doesn't fit our needs, it's not free and don't solve our problem of attaching a signature to multiple pdf files from a single html form.
I realize that I could use the nodejs module pdfkit to generate a new blank PDF with the signature in the right position and just over the two pdfs, having the blank pdf with the signature image on the top like a stamp. I could do this with pdftk by command line:
pdftk form.pdf stamp signature.pdf output form-signed.pdf
There's another free application tool like pdftk that I just found out, and it's pdfjam. Also, if you can't use nodejs module pdfkit (different from the application pdfkt) to generate a new pdf with an image to the right spot, you have the stampTK tool (http://www.pdflabs.com/tools/stamptk-the-pdf-stamp-maker/) where you can parse through command line the image to be a stamp in an existing pdf, but this tool is paid (not much, and its worth it), but as I have the pdfkit module for nodejs that enables me to do just that along side the pdfkt application for free, I'm using that, and I also have more control of multiple signatures on the pdfkit module for nodejs. Hope this answer helps someone.

Resources