How to extract text from Aadhar pdf without OCR? - python-3.x

I need to extract text from Aadhar PDFs without using OCR, but just by using python modules like PyPDF2 and PdfMiner .
I tried using these modules but wasn't able to
How can I do it? or why these text extraction modules dont work on Aadhar PDFs?
Are Aadhar PDFs protected against text extraction somehow?
can PDFs be protected against text extraction?

Related

Extracting Text from multiple pages of PDF using tesseract OCR in node.js

I am currently working on a project for extracting text from multi paged PDFs(these PDFs are generally circulars or application forms) using tesseractOCR (in node.js) ,since tesseract only takes images as input i am not able to pass the pdf. I need a code to help me pass each page of the pdf and get the result back(if the result of pages are appended then its not a problem).
I tried using pdf-poppler,I dont necessarily need to use pdf-poppler.
Technologies using : tesseractOCR for js,Node.js
Additional/optional info: Can i get suggesstion on some free open source OCR to use and how to parse the text i get.

How to redact texts in a pdf file in NodeJs

I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.

Extract embedded pdf from word document(docx) file

Able to extract the embedded images using [XWPF] (https://poi.apache.org/apidocs/dev/org/apache/poi/xwpf/usermodel/XWPFDocument.html). Unable to extract embedded pdf from docx file.
Can anyone please suggest something on this?

Convert any document, image, text file into PDF

I want to convert any documents or image or text file into PDF for all the OS.
I tried the approach with node-msoffice-pdf, and its working fine for Windows OS but not working in other OS.
Question:
How to convert docs, images, textfile to pdf in nodejs?
I used wkhtmltopdf from years to manage pdf conversion.
https://github.com/devongovett/node-wkhtmltopdf
You can either render an html file and pass it to the module, or render a pdf directly from an url.
If fidelity/conversion quality is important to you, for Word documents (doc/docx) you could try our freemium https://www.npmjs.com/package/#nativedocuments/docx-wasm which will perform the conversion locally (ie where node is running), without the need to LibreOffice etc.

IPTC metadata to TIFF from EXCEL readable in Bridge

I have an Excel sheet with fields such as [name][url in folder][keywords] ... I am trying to find the best way to write IPTC metadata keywords to my 60'000 TIFF images in order to be able to search through them (with Adobe Bridge) from this Excel file. I have tried exiftool.exe but "Adobe Bridge" cannot read the rendering keywords. I have seen that it may be possible in PHP, but I would like to know if code or software already exists.
Any IPTC library can do it for you. I use Python so for example http://tilloy.net/dev/pyexiv2/ would be my tool. Look at the tutorial on http://tilloy.net/dev/pyexiv2/tutorial.html

Resources