After using pdftotext: find page of string from txt - python-3.x

I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?

You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.

Related

Find a file with all given strings on Linux

Similar to this question How do I find all files containing specific text on Linux? but I want all the files that contain multiple given strings (these strings not necessarily next to each other or on the same line, just in the same file).
My use case is I am looking at a UI and want to modify the file which controls this particular screen. The codebase though is huge and it is difficult to locate this file. All I have to go on is some of the hardcoded strings on this screen which I would like to do the search on. The strings are quite generic though such as 'Done', 'Close', 'View Details'... Doing a search on any of these strings individually, using the answer from the linked question above, brings back too many results but I think doing the search on all of them together will filter it down enough to find the file.

Delimiting quick-open path with fullstops in Sublime Text 3?

I'm making the move to ST3, and I'm having some trouble. I'd like to be able to delimit the quick-open filepath (⌘ + T) with periods instead of slashes or spaces. However, I can't find the setting to do that.
For example:
component.biz_site_promotions.presentation
should be able to open the file that
component biz_site_promotions presentation
would.
Any help would be greatly appreciated!
There is no setting in Sublime that changes the way this works; the search term is always used to directly match the text in the list items (except for space characters).
Note however that the Goto Anything panel uses fuzzy matching on the text that you're entering, so in many cases trying to enter an entire file name is more time consuming anyway.
As an example, to find the file you're mentioning, you could try entering the text cbspp, which in this case is the first letters of all of the parts of the file name in question.
As you add to the search term, the file list immediately filters down to text that matches what you entered; first only filenames that contain a C, then only filenames that contain a C that is followed somewhere after by a B, and so on.
Depending on the complexity and number of files that you have in your project, you may need to add in a few extra characters to dial in better (e.g. comb_s_pp). Usually this search method will either end you up at the exact file you want, or filter the list so much that the file that you want will be easier to find and select.
Additionally, when you select an item and there was more than one possible match, Sublime remembers which item you selected for that particular search term and brings it to the top of the search results next time you do it, under the assumption that you want the same thing again.
As you use Sublime more (and with different projects) you will quickly get a handle on what partial search terms work the best for you.
In addition to finding files, you can do other things with that panel as well, such as jumping to a specific line and/or column or searching inside the file for a search term and jumping directly to it. This applies not only to the current file but also the one that you're about to open.
For more complete details, there is a page in the Unofficial Documentation that covers File Navigation with Goto Anything
As an extra aside, starting with Sublime Text build 3154, the fuzzy searching algorithm handles spaces differently than previous builds.
Historically, spaces in the search term are essentially ignored and the entire input is treated as one search term to be matched character by character.
Starting in build 3154, spaces are handled by splitting up a single search term into multiple search terms, which are applied one after the other.
This allows multiple search terms to hit out of order. For example, index doc in build 3154 will find doc/index.html, but it won't find it in previous versions because the terms aren't in the right order.
As such, assuming you're not currently using such a build (as of right now it's a development build, so only licensed users have access to it), moving forward if you continue to search the way you're searching in your question, you might start getting more results than you expected.

Linux PdfToText function return blank text file

I've used a linux function to convert a list of PDF files to text.
Command:
pdftotext -htmlmeta
This work well for most of my files.
but for a small amount of them, this return me a blank text file.
My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.
Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.
Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.
Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.
So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.

Efficient Way of Recording Page Numbers from a Search of a PDF

I have a list of ~1200 queries (part numbers) that are specified somewhere inside of a 100 page PDF. Pretty much what I need to do is take record of what pages each of the queries appear on, in the PDF. I can't think of a clever way of doing this. It should take me 5-20 hours to do this search by search, so if someone can give me a good idea before the 5 hour mark that would be great!
Assumed you can determine what a "query" is in your context programatically from the plain text (for example, by using regular expressions):
You could split your PDF into different files (1 file per page) using pdftk
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Then convert those files to text with a pdf-to-text utility like this one:
http://www.fileguru.com/PDF-To-TXT-Converter/download
or this one
http://www.pdf2text.com/
And finally write yourself a simple script using your favorite programming language to determine which of those files contains a "query" (whatever that looks like).

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Resources