Adobe Acrobat/Python PDF Outputs Varying - python-3.x

I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.
In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.
However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.
Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?
Thank you!
Regards,
Mano

So what ended up working for me was running the initial parsing with Apache-Tika and then, on the few that didn't work on, pass them through PyPDF2. My theory is that PyPDF2 uses a different mechanism for parsing that doesn't rely on the root of the PDF unlike Tika and that is what seems to have become corrupted in a few of these OCR'd docs.
Not sure of the initial cause but that was my solution.

Related

How to identify if text encoding issue is my processing error or carried from the source pdf

I have a selection of pdfs that I want to text mine. I use tika to parse the text out of each pdf and save to a .txt with utf-8 encoding (I'm using windows)
Most of the pdfs were OCR'd before I got them but when I view the extracted text I have "pnÁnn¿¡c" instead of "Phádraig" if I view the PDF.
Is it possible for me to verify the text layer of the PDF (forgive me if thats the incorrect term) Ideally without needing the full version of Acrobat
It sounds like you are dealing with scanned books with "hidden OCR", ie. the PDF shows an image of the original document, behind which there is a layer of OCRed text.
That allows you to use the search function and to copy-paste text out of the document.
When you highlight the text, the hidden characters become visible (though this behaviour maybe depends on the viewer you use).
To be sure, you can copy-paste the highlighted text to a text editor.
This will allow you to tell if you are actually dealing with OCR quality this terrible, or if your extraction process caused mojibake.
Since OCR quality heavily depends on language resources (dictionaries, language model), I wouldn't be surprised if the output was actually that bad for a low-resource language like Gaelic (Old Irish?).

Photo not loading in markdown python

I've recently began coding for my degree and for a project I am submitting it via a pdf created in Jupyter so that my code can be seen. It all works within Jupyter but when I export to PDF the image that I have embedded in markdown doesn't load. All that loads in Microsoft edge is a small black box with a white cross in and in chrome there is a small image of mountains in two pieces. I am not sure where I'm going wrong. My image is written in like this:
<img src="files/masterbiaspic.png" />
And I don't know how to fix it.
I really don't have a wide knowledge of code so please be simple with your answers.
Kind regards and happy new year,
E
You appear to be using raw HTML to insert your images into your document. What you may not know is that most Markdown parsers do not look at the contents of raw HTML, they simply pass it through unaltered. However, raw HTML is not understood by the PDF file format, and in fact, when converting to PDF, there is no clean way to convert raw HTML to PDF without also parsing the HTML (which is beyond the scope of Markdown parsers). Therefore, if you want to output to PDF, you should only use pure Markdown (without any raw HTML). That way the parser can easily convert everything to a proper format for PDF output.
As it turns out, Markdown includes its own syntax for images (see the documentation for details). Try this:
![alt text](files/masterbiaspic.png)
By doing that, Jupyter Notebook will know about the image and should import it into the PDF properly.
It could be that the above will not resolve the problem. It depends on which method is used to convert to PDF. Some tools may take the HTML output of Markdown and convert that to PDF, which would mean you have a different problem entirely.

Remove images (with transparency/alpha channel) from PDF

How to remove images with alpha channel (transparency) in a PDF file?
I need to remove all images with transparency from a PDF file because it needs to be optimized with pdf2ps and ps2pdf (to reduce filesize).. Postscript doesn't work properly when the PDF contains images with transparency and the PDF will be converted to one big image..
I have not managed to reproduce your problem.
For cons, I did the same treatment to compress my pdf except that I used pdftops instead of pdf2ps.
I hope it will help.
Sorry for my english (translate.google)
Clark,
It sounds like www.pstill.com will do everything you need and more in one tool. There is a Linux command line version available for a very reasonable price. I have used the tool on a few different PDF's for different reasons and it has always worked as advertised.
From their website.
Putting the 'Portable' back in PDF - PDF to PDF Transcoding
Your PDF cannot be printed on some printers or processed with some applications? PStill can sanitize, simplify, reprocess, flatten transparency and recompress PDF-Files, this process also known as 'transcoding' create a new PDF that has better compatibility, is often smaller in file size, can be optional encrypted/secured and contain only a uniform set of font types. Fonts can be normalized to plain PostScript Type 1 formats, can be subsetted, missing fonts included and bad fonts repaired/replaced. PStill can detect and remove duplicate elements in the PDF. Text can be converted to outlines which makes it perfect for creating 'fontless' PDF. Transcoding can be used to repair bad PDF or simplify the PDF structure so more limited output devices can process it.
Andrew.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Converting troublesome delimited PDF to Excel

I'm trying to convert this delimited PDF to an excel (or some other delimited format). Using Adobe Acrobat 9, I attempt to save it and copy it) as Excel but it gives the error message "BAD PDF; error in processing fonts. [348]".
I'm open to any solution that will create a delimited file, ranging from using Adobe Acrobat, to programming to using other apps. The only limitation is that I don't have a budget to buy anything (such as Able2Extract).
The way I was able to export my images and fonts without buying any extra software to do the conversion was this way. go to Advanced, PDG Optimizer, select all of the options you want on the LEFT COLUMN and where it says MAKE COMPATIBLE WITH select Acrobat 8.0 and later, OK....you are in your road to success
Note: not really an answer, but some suggestions.
Sounds to me that Crystal Reports is not following the PDF spec close enough.
I'd make sure CR is fully updated/patched and try genning another file making sure that "tagging" is enabled - tagging defines the layout structure. I don't have a copy of CR handy, but you may have to define a distiller template to use so when you print to PDF you can select that job option.
You can also tell its a bad PDF by using Preflight in Acrobat, it says there is no tag structure and you can do it manually (draw boxes around each item...). Also that there is no language set, and it is somehow compatible with Acrobat 1.3? which isn't supported anymore and should be 4 at the lowest?
Once you have a "good" pdf can export to xml/word and import that to excel. Also, with Acrobat 8+ you can highlight using the select tool, right-click and choose Open As SpreadSheet. You might be able to get away with just highlighting the whole document -- though I'd hope the xml approach would be best.
Able2Extract does some OCRing and tricky fuzzy logic not only to define tags/layout so it is exportable, but also avoids any font, encoding, etc issue - at least to my knowledge.
In the rare case that you can't get a new file, then exporting to plain text/accessible seems to generate a nice flat text file. You could write a vbscript to parse it (adding your delimiter) and import that into excel.

Resources