How can I resize an image from an existing PDF file in Node.js? - node.js

I have a PDF file that has an image and some text. I want to read that file and then resize the image, and delete the text.
I tried taking a screenshot of the whole PDF with pdf-poppler and then do some image processing with Jimp, it worked but the program is taking too long to finish executing because the images are quite big.

Adnane, you can try to use pdf-lib.

I don't use Node.js but the Poppler library comes with a binary pdftoimages which extracts all images from a PDF and there is a Node.js wrapper for Poppler.

Related

How can I take high-quality screenshots of a PDF without ImageMagick using Python?

I would like to automate the process of taking screenshots of a PDF file's pages. I want to be able to specify the zoom (optional) so that the overall image size can be controlled. I would also like to be able to specify the dpi of the screenshots being saved.
Sample PDF file can be found at this link.
I have already tried opening the file with selenium web driver (Firefox), but the scrolling is not supported for rendered PDF files, apparently.
Is there a way to render this PDF file and then use any image processing module like Pillow or Open-CV to take the screenshots, or any module that does it directly?

Image preprocessing in Python for OCR

I'm doing pre-processing of images for OCR in python. I converted the pdf to binary images. The output I get is like this
I want the ouput to be something like this
Any idea how to go about this?
You have to use Tesseract library for extracting text from given image.
I am using window system so I downloaded it from location https://sourceforge.net/projects/tesseract-ocr-alt/files/.
Suppose you have installed it at location "E:\w\Tesseract-OCR"
Then put your image at the same location. Lets call your image question.png
Now go to command prompt and give command,
E:\w\Tesseract-OCR>tesseract.exe question.png answer.txt
Where answer.txt is text file which Tesseract will create you can give any other name instead of answer.txt and question.txt is your file.
Once command is successfully executed check output in answer.txt.
In case of your image I got following output.
Investment Type: Customer Owned
System Information
Fire III
Video I]
So in this case it is recognizing only text correctly.

PDF display garbled in Chrome

I see this when clicking a link to a PDF stored on Amazon S3 in Chrome:
If I download the same URL using wget or follow the same link in Firefox the PDF displays normally.
It looks like Chrome is not interpreting the file as a PDF. Is the problem with the PDF file or with Chrome? The PDF file was generated by wkhtmltopdf 0.12.3 (with patched qt) on Arch Linux.
Edit: it seems like a problem with the PDF because when I use file to identify the format it returns "data" whereas a normal PDF returns something like "PDF document, version 1.6".
I figured it out. I was using PDFKit to generate PDFs with the verbose option on. The verbose option somehow put all of stdout inside the PDF itself which caused Chrome to not detect the file as a PDF.

How to generate pdf file of text and image in linux?

I am generating a logfile on one of my servers.
Storing alot of data, then sending it to my mail once a month as a pdf file.
The prosess i am using is to 'cat' alot of commands to a text file, then convert it and send.
Is there any linux programs or some eazy way to do something simulare and add a image i have stored on the server in the pdf file?
This answer assumes that you just want to put the image at the end of the PDF.
You could first convert the image using imagemagick to a PDF doing this (will also work with different file types):
convert image.jpg image.pdf
Then, you can use a tool like stapler or pdftk to combine your generated text PDF and the image.pdf (you can add multiple images):
stapler cat text.pdf image.pdf combined.pdf
pdftk text.pdf image.pdf output combined.pdf

Remove images (with transparency/alpha channel) from PDF

How to remove images with alpha channel (transparency) in a PDF file?
I need to remove all images with transparency from a PDF file because it needs to be optimized with pdf2ps and ps2pdf (to reduce filesize).. Postscript doesn't work properly when the PDF contains images with transparency and the PDF will be converted to one big image..
I have not managed to reproduce your problem.
For cons, I did the same treatment to compress my pdf except that I used pdftops instead of pdf2ps.
I hope it will help.
Sorry for my english (translate.google)
Clark,
It sounds like www.pstill.com will do everything you need and more in one tool. There is a Linux command line version available for a very reasonable price. I have used the tool on a few different PDF's for different reasons and it has always worked as advertised.
From their website.
Putting the 'Portable' back in PDF - PDF to PDF Transcoding
Your PDF cannot be printed on some printers or processed with some applications? PStill can sanitize, simplify, reprocess, flatten transparency and recompress PDF-Files, this process also known as 'transcoding' create a new PDF that has better compatibility, is often smaller in file size, can be optional encrypted/secured and contain only a uniform set of font types. Fonts can be normalized to plain PostScript Type 1 formats, can be subsetted, missing fonts included and bad fonts repaired/replaced. PStill can detect and remove duplicate elements in the PDF. Text can be converted to outlines which makes it perfect for creating 'fontless' PDF. Transcoding can be used to repair bad PDF or simplify the PDF structure so more limited output devices can process it.
Andrew.

Resources