I wrote a bash script that extracts plain text from scanned PDF files. I've got lots of PDF's but some are scanned and some other are not. So now my main goal is to improve my script by checking if PDF's are already searchable, so no OCR extraction will be needed.
I've tried:
pdftext -nopgbrk pdf_file.pdf wordlist
to store possible OCR'ed text in wordlist, so then I can check if it's empty and figure out whether it's a searchable PDF or not.
I've also tried pdffonts pdf_file.pdf to check if there're fonts in that PDF and therefore if there's text on it or not.
Both ways work pretty fine but are failing in some cases.
For example, some of the PDF's I need to OCR are digitally signed, and those signatures always add a text layer to PDFs. So when I run any of those two commands, it'll output either the signature's text, or the font that it's using. It's like if it had found plain text just because of the signing. It might just be a scanned PDF with a digital signature, but it'll be detected as a plain text PDF.
Digital signings always add text this way (using Helvetica font):
Signed by: Name
Date: Date CEST
Company: Company Name
So with:
pdftext -nopgbrk pdf_file.pdf wordlist | grep -v -E 'Signed|Date|Company'
I can manage to remove those lines so if it's really a scanned PDF, the output will be empty.
It worked for some PDF's until I noticed a signature that had some other format, so I feel this is pretty much of a work-around and not a great solution.
Is there any way to check if a PDF is fully searchable? I just need a way to extract PDF's text but omitting digital signings. Also grep -v will always depend on our digital signature's format and if it changes then it'll screw up my script.
Thanks.
Unfortunately, there really isn't an easy way to do this in a "non-hacky" way without significantly more involved analysis of the file which would be far beyond the scope and scale of a bash script.
When pdftotext outputs the text for the digital signature, that text is not coming from the digital signature itself. That is stored as an object in the PDF with metadata that pdftotext ignores. Instead, what pdftotext picks up is just that: text which has also been added to the file.
Here's an example from Adobe's sample signed PDF document. First, the digital signature's metadata:
And here is the text which is inserted into the document:
Technically, you can have one without the other, and there is no established format for the text that generally accompanies a digital signature. Therefore, you're stuck either:
Ignoring specific text with grep, as you are doing now, which can be unreliable.
Running OCR on all files and then checking if there is a difference in the text before/after OCR, but then this defeats the whole purpose of checking in the first place.
Related
I am looking for a way to convert or save a text file in the UCS-2 LE format; specifically without BOM...i guess.
I have zero knowledge what any of that means actually; but i know i need that because of this wiki page on what i am trying to accomplish: https://developer.valvesoftware.com/wiki/Closed_Captions
in other words:
this is for a specific game engine, "Source Engine," which requires the format in order to compile in-game closed captions for sounds.
I have tried saving the file in Notepad++ using the "UCS-2 LE BOM" option under the encoding menu...there is no option for just "UCS-2 LE" however, and because of this, the captions cannot be compiled for the game engine. I need to save without BOM, "I guess" (because again I don't know what I'm talking about and I assume based on logical conclusions, that I need to not have BOM, whatever that actually means.)
I would like to know about a way to either save a txt file in that encoding format; or a way to convert one.
In my specific case; it appears that my problem boils down to "the program is weird."
what I mean by this is, notepad++ actually does save in the correct format; but I failed to realize that because of a quirk in the caption compiler where it only works if you drag the file onto it; not via command line as previously thought.
I will accept this as the answer when i am allowed to in 2 days.
I'm curious how does PDF securing work? I can lock PDF file so system can't recognize text and manipulate with PDF file. Everything I found was about "how to lock/unlock" however nothing about "how does it work". Is there anyone who could explain it to me? Thx
The OP clarified in a comment
I mean lock on text recognition or manipulation with PDF file. There should be nothing about cryptography imho just some trick.
There are some options, among them:
You can render the text as a bitmap and include that bitmap in the PDF
-> no text information.
Or you can embed the font in question using a non-standard encoding without using standard glyph names
-> text information in an unknown encoding.
E.g. cf. the PDF analysed in this answer.
A special case: make the encoding wrong only for a few characters, maybe just one, probably a digit. This way an unalert person thinks everything was extracted ok, and only when the data is to be used, the errors start screwing things up, something which especially in case of wrong digits is hard to fix. E.g. cf. the PDF analysed in this answer.
Or you can put text in structures where text extraction software or copy&paste routines usually don't look, like creating a large pattern tile containing the text for some text area and filling the area with the matching pattern color.
-> text information present but not seen by most extractors.
E.g. cf. this answer; the technique here is used to make the text of a watermark non-extractable.
Or you can put extra text all over the page but make it invisible, e.g. under images, drawn in rendering mode 3 (invisible), located in some disabled optional content group (layer), ... Text extractors often do not check whether the text they extract actually is visible.
-> text information present but polluted by garbage text bits.
...
I've used a linux function to convert a list of PDF files to text.
Command:
pdftotext -htmlmeta
This work well for most of my files.
but for a small amount of them, this return me a blank text file.
My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.
Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.
Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.
Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.
So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.
My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.
The question probably sounds a little odd, but the actual task is relatively simple, I swear!
I'm automatically generating some PDFs from a webform, using PDFCreator to merge a generated FDF into a preexisting PDF. I created the preexisting PDF in NitroPDF. This setup works great - almost. The problem is that when you view the generated PDFs in Adobe Reader 9 (the most common reader) a subset of the fields are just blank. The information is still there; using previous versions of Adobe Reader or a different reader like Foxit Reader shows the entire PDF. No clue what's going on, and Adobe tech support was useless since I didn't create the PDF with Adobe software. (If you'd like to help fix this problem instead of the following, feel free to email me.)
However, if I take the resultant PDF and print it to a fresh PDF using a PDF printer driver, it works great everywhere. This is time-consuming and annoying for our sales department to do themselves, so I want to perform this step automagically upon creating the first PDF.
I'm in ubuntu, and have command-line root access to the server. The program is written in PHP, and can easily make system calls. I'm just having trouble figuring out how to tie things together properly so that I can automatically print a known file using a specific printer driver to another known file.
You could try putting your PDF files through Ghostscript. I have found that this is enough to fix many problematic PDFs.
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf input.pdf
(The same command can also be used to merge several PDF files into one, just specify multiple input files.)