Remove images (with transparency/alpha channel) from PDF - linux

How to remove images with alpha channel (transparency) in a PDF file?
I need to remove all images with transparency from a PDF file because it needs to be optimized with pdf2ps and ps2pdf (to reduce filesize).. Postscript doesn't work properly when the PDF contains images with transparency and the PDF will be converted to one big image..

I have not managed to reproduce your problem.
For cons, I did the same treatment to compress my pdf except that I used pdftops instead of pdf2ps.
I hope it will help.
Sorry for my english (translate.google)

Clark,
It sounds like www.pstill.com will do everything you need and more in one tool. There is a Linux command line version available for a very reasonable price. I have used the tool on a few different PDF's for different reasons and it has always worked as advertised.
From their website.
Putting the 'Portable' back in PDF - PDF to PDF Transcoding
Your PDF cannot be printed on some printers or processed with some applications? PStill can sanitize, simplify, reprocess, flatten transparency and recompress PDF-Files, this process also known as 'transcoding' create a new PDF that has better compatibility, is often smaller in file size, can be optional encrypted/secured and contain only a uniform set of font types. Fonts can be normalized to plain PostScript Type 1 formats, can be subsetted, missing fonts included and bad fonts repaired/replaced. PStill can detect and remove duplicate elements in the PDF. Text can be converted to outlines which makes it perfect for creating 'fontless' PDF. Transcoding can be used to repair bad PDF or simplify the PDF structure so more limited output devices can process it.
Andrew.

Related

Adobe Acrobat/Python PDF Outputs Varying

I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.
In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.
However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.
Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?
Thank you!
Regards,
Mano
So what ended up working for me was running the initial parsing with Apache-Tika and then, on the few that didn't work on, pass them through PyPDF2. My theory is that PyPDF2 uses a different mechanism for parsing that doesn't rely on the root of the PDF unlike Tika and that is what seems to have become corrupted in a few of these OCR'd docs.
Not sure of the initial cause but that was my solution.

Linux PdfToText function return blank text file

I've used a linux function to convert a list of PDF files to text.
Command:
pdftotext -htmlmeta
This work well for most of my files.
but for a small amount of them, this return me a blank text file.
My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.
Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.
Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.
Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.
So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.

Hidden/Open words in an Image file such as PNG or JGP

As far as I can tell my question is not related to topics involved in Stenography or in the win.rar soluations I've seen to this where you are essentially hidding messages.
I am trying to figure out if there is a way to insert code into a file such as a jpg or png with a simple message, that could later be extracted by a program reading the file without having it encoded into the file either by slight differences in pixels or what have you in stenography.
I basically just want a tag along message that is a part of the file itself that is not brought up by the image reader but could perhaps be seen by a text reader of some kind.
I'm not sure how possible this is because I, for the most part don't understand the order/layout of the png/jgp/ect file aside from the RGB pixel code. How does it start, how does the image display tool know to stop displaying ect.
The way I'm envisioning it would be something like:
pngStartCode -> RGBinfo --> png end code so image reader knows to stop -> start sequence that some kind of reader will recognize (possibly a new text reader) -> written text wanted to be communicated -> endcodeforreader
I may just be rambling about something ridiculous here but please let me know if this is at least possible.
You can use following command(Windows command prompt)
Create a text file with your message, say "message.txt"
Now choose target file(it can be any file like a.jpg,a.png,a.exe,..etc), say "image.jpg"
Now execute follwing command
copy /b "image.jpg"+"message.txt" "NewImage.jpg"
Above command will combine files(in binary mode) and creats a new file(in this case NewImage.jpg). Now if anyone opens image they will just see noraml image. If you want to look at text, you have open it with any text editor(Notepad) and scroll down to last, there you can find text.
Here it wont chage any pixels or any thing to image, it just appends text to image.
It sounds like OP is asking about comment tags in the PNG specifications (i.e. adding data but without intent to hide it).
PNG files are broken into "Chunks". The image part is usually divided into several IDAT chunks; the color, size, etc are stored in an IHDR chunk, etc.
The iTXt, tEXt, and zTXt chunks are used for conveying text information associated with the image, so typically you'd look into using a tool to add those types of chunks. tEXt is for just plain text, zTXt is compressed.
More info on the PNG specification including what kinds of chunks are available can be found here, and you find chunk viewers on google.
For convenience at preset time (January 2021) here are a couple tools that will let you view, edit, and add chunks:
Windows 10: http://entropymine.com/jason/tweakpng/
Linux: https://www.systutorials.com/docs/linux/man/n-png/
Mac: https://apps.apple.com/us/app/inspectpng/id498851708?mt=12
NOTE: I do not vouch for the safety of any of the above links. Please use standard caution when downloading any file from the internet. If you don't have your own anti-virus, Virustotal has one online you can upload individual files to for free.

Converting PostScript to Text Using GhostScript

I want to extract Text data out of PostScript documents. The problem is when I use GhostScript to do that, some texts would be extracted normally while others would be converted to weird symbolic characters.
I realized that the texts, which had normally been extracted, were in fonts that GhostScript would NOT embed them in PDF because of licensing restrictions. And, ironically the fonts without licensing restrictions which were normally embedded in PDF, weren’t been converting back correctly.
I tried both txtwrite device to convert directly the PostScript to Text and also pdfwrite device to first convert the PS to PDF and then extract the text out of the PDF Document, but neither of them worked.
I thought maybe I could be able to substitute all fonts with the unsupported fonts, so that the text data would be extracted correctly, but came out there is no simple way to do that.
What do you think I should do?
The cause of this is usually that the characters are encoded in a non-standard fashion. I'm afraid there is not a lot you can do, except possibly for finding out by comparing the readable PostScript with the extracted text which "weird symbolic characters" corresponds to what actual character. Then you might be able to reconstruct the original text by replacing the weird with the intended characters.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Resources