extract location of a specified string in a pdf file - string

I'm not familiar with pdf rendering system or postscript, and I'd like to know if in principle - it would be possible to extract the location of a string in a pdf. that is:
given a pdf with regular text paragraphs (not form-fileds\text boxes or other objects, simple text)
search for a specific string in the file
get the x,y coordinates of that the first letter.
I've searched pdf-libs in many languages but they don't seem to allow such operation.
does pdf standard supports this?

The closest thing I could find involves finding the location of a text box
(see here)
Depending on your use case, this could help.
for instance, in my case, I wanted to replace a specified string with another string. A possible solution for me:
Include a text box in the original pdf (the author of the pdf can do that using adobe acrobat pro or equivalent)
Find the text box using code and extract it's location
remove the text box from the document and insert your text at the extracted position.

Related

How to identify if text encoding issue is my processing error or carried from the source pdf

I have a selection of pdfs that I want to text mine. I use tika to parse the text out of each pdf and save to a .txt with utf-8 encoding (I'm using windows)
Most of the pdfs were OCR'd before I got them but when I view the extracted text I have "pnÁnn¿¡c" instead of "Phádraig" if I view the PDF.
Is it possible for me to verify the text layer of the PDF (forgive me if thats the incorrect term) Ideally without needing the full version of Acrobat
It sounds like you are dealing with scanned books with "hidden OCR", ie. the PDF shows an image of the original document, behind which there is a layer of OCRed text.
That allows you to use the search function and to copy-paste text out of the document.
When you highlight the text, the hidden characters become visible (though this behaviour maybe depends on the viewer you use).
To be sure, you can copy-paste the highlighted text to a text editor.
This will allow you to tell if you are actually dealing with OCR quality this terrible, or if your extraction process caused mojibake.
Since OCR quality heavily depends on language resources (dictionaries, language model), I wouldn't be surprised if the output was actually that bad for a low-resource language like Gaelic (Old Irish?).

After using pdftotext: find page of string from txt

I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?
You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.

is it possible to find which page and/or line number a given text was found using full text search and filestream?

I just started using the filestream and full text search technologies available on Microsoft Sql Server, I can index and search txt and pdf files, however, when I get the results I can't see the text, nor which page and/or line number that text was found inside the pdf for example, is it possible to at least retrieve the text from the document when a search is made? I believe it's not possible to return a "region" of text but maybe something I can use to look for in the file afterwards?
I'm trying to figure out the advantages of doing a search like this if I can't see the text that was found.
After doing a lot of research I concluded it isn't possible to search for a given page on an indexed pdf, so I decided to use solr instead and index the information the way I need to search later

Linux PdfToText function return blank text file

I've used a linux function to convert a list of PDF files to text.
Command:
pdftotext -htmlmeta
This work well for most of my files.
but for a small amount of them, this return me a blank text file.
My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.
Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.
Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.
Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.
So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Resources