How can I take high-quality screenshots of a PDF without ImageMagick using Python? - python-3.x

I would like to automate the process of taking screenshots of a PDF file's pages. I want to be able to specify the zoom (optional) so that the overall image size can be controlled. I would also like to be able to specify the dpi of the screenshots being saved.
Sample PDF file can be found at this link.
I have already tried opening the file with selenium web driver (Firefox), but the scrolling is not supported for rendered PDF files, apparently.
Is there a way to render this PDF file and then use any image processing module like Pillow or Open-CV to take the screenshots, or any module that does it directly?

Related

How can I resize an image from an existing PDF file in Node.js?

I have a PDF file that has an image and some text. I want to read that file and then resize the image, and delete the text.
I tried taking a screenshot of the whole PDF with pdf-poppler and then do some image processing with Jimp, it worked but the program is taking too long to finish executing because the images are quite big.
Adnane, you can try to use pdf-lib.
I don't use Node.js but the Poppler library comes with a binary pdftoimages which extracts all images from a PDF and there is a Node.js wrapper for Poppler.

PDF display garbled in Chrome

I see this when clicking a link to a PDF stored on Amazon S3 in Chrome:
If I download the same URL using wget or follow the same link in Firefox the PDF displays normally.
It looks like Chrome is not interpreting the file as a PDF. Is the problem with the PDF file or with Chrome? The PDF file was generated by wkhtmltopdf 0.12.3 (with patched qt) on Arch Linux.
Edit: it seems like a problem with the PDF because when I use file to identify the format it returns "data" whereas a normal PDF returns something like "PDF document, version 1.6".
I figured it out. I was using PDFKit to generate PDFs with the verbose option on. The verbose option somehow put all of stdout inside the PDF itself which caused Chrome to not detect the file as a PDF.

Photo not loading in markdown python

I've recently began coding for my degree and for a project I am submitting it via a pdf created in Jupyter so that my code can be seen. It all works within Jupyter but when I export to PDF the image that I have embedded in markdown doesn't load. All that loads in Microsoft edge is a small black box with a white cross in and in chrome there is a small image of mountains in two pieces. I am not sure where I'm going wrong. My image is written in like this:
<img src="files/masterbiaspic.png" />
And I don't know how to fix it.
I really don't have a wide knowledge of code so please be simple with your answers.
Kind regards and happy new year,
E
You appear to be using raw HTML to insert your images into your document. What you may not know is that most Markdown parsers do not look at the contents of raw HTML, they simply pass it through unaltered. However, raw HTML is not understood by the PDF file format, and in fact, when converting to PDF, there is no clean way to convert raw HTML to PDF without also parsing the HTML (which is beyond the scope of Markdown parsers). Therefore, if you want to output to PDF, you should only use pure Markdown (without any raw HTML). That way the parser can easily convert everything to a proper format for PDF output.
As it turns out, Markdown includes its own syntax for images (see the documentation for details). Try this:
![alt text](files/masterbiaspic.png)
By doing that, Jupyter Notebook will know about the image and should import it into the PDF properly.
It could be that the above will not resolve the problem. It depends on which method is used to convert to PDF. Some tools may take the HTML output of Markdown and convert that to PDF, which would mean you have a different problem entirely.

Remove images (with transparency/alpha channel) from PDF

How to remove images with alpha channel (transparency) in a PDF file?
I need to remove all images with transparency from a PDF file because it needs to be optimized with pdf2ps and ps2pdf (to reduce filesize).. Postscript doesn't work properly when the PDF contains images with transparency and the PDF will be converted to one big image..
I have not managed to reproduce your problem.
For cons, I did the same treatment to compress my pdf except that I used pdftops instead of pdf2ps.
I hope it will help.
Sorry for my english (translate.google)
Clark,
It sounds like www.pstill.com will do everything you need and more in one tool. There is a Linux command line version available for a very reasonable price. I have used the tool on a few different PDF's for different reasons and it has always worked as advertised.
From their website.
Putting the 'Portable' back in PDF - PDF to PDF Transcoding
Your PDF cannot be printed on some printers or processed with some applications? PStill can sanitize, simplify, reprocess, flatten transparency and recompress PDF-Files, this process also known as 'transcoding' create a new PDF that has better compatibility, is often smaller in file size, can be optional encrypted/secured and contain only a uniform set of font types. Fonts can be normalized to plain PostScript Type 1 formats, can be subsetted, missing fonts included and bad fonts repaired/replaced. PStill can detect and remove duplicate elements in the PDF. Text can be converted to outlines which makes it perfect for creating 'fontless' PDF. Transcoding can be used to repair bad PDF or simplify the PDF structure so more limited output devices can process it.
Andrew.

Batik svg conversion upon click event

I have a simple webpage with an editable .svg image preview which includes some text, which the user can enter via a standard html form. When they're happy and want to continue to the next step, they click the save button. Theoretically the image would then be converted into a .jpg and saved to the server.
I have just come across Batik svg to image convertor and have successfully used it from the command line, as follows...
C:\inetpub\wwwroot\batik>java -jar batik-rasterizer.jar
samples/input.svg
-d orders
-m image/jpeg
-q 0.99
-dpi 150
My question is... can this batik tool be configured to take the svg, after an onclick event (button) and then convert and save it to a specified folder? In fact, is this the right tool at all?
Any ideas or direction would be greatly appreciated.
cheers
Dec
It should work fine. You just need to deal with getting the text back to the server and updating a server copy of the SVG file. Then raster it with Batik. The first step would be best achieved with AJAX. How you do the server end depends on what sort of server you have running. If it is PHP, Python etc, the simplest way is probably to use the command line interface for Batik (as per your example). If you are running a Java web framework like Jetty or Tomcat, you could use the Batik library directly.

Resources