is it possible to find which page and/or line number a given text was found using full text search and filestream? - search

I just started using the filestream and full text search technologies available on Microsoft Sql Server, I can index and search txt and pdf files, however, when I get the results I can't see the text, nor which page and/or line number that text was found inside the pdf for example, is it possible to at least retrieve the text from the document when a search is made? I believe it's not possible to return a "region" of text but maybe something I can use to look for in the file afterwards?
I'm trying to figure out the advantages of doing a search like this if I can't see the text that was found.

After doing a lot of research I concluded it isn't possible to search for a given page on an indexed pdf, so I decided to use solr instead and index the information the way I need to search later

Related

extract location of a specified string in a pdf file

I'm not familiar with pdf rendering system or postscript, and I'd like to know if in principle - it would be possible to extract the location of a string in a pdf. that is:
given a pdf with regular text paragraphs (not form-fileds\text boxes or other objects, simple text)
search for a specific string in the file
get the x,y coordinates of that the first letter.
I've searched pdf-libs in many languages but they don't seem to allow such operation.
does pdf standard supports this?
The closest thing I could find involves finding the location of a text box
(see here)
Depending on your use case, this could help.
for instance, in my case, I wanted to replace a specified string with another string. A possible solution for me:
Include a text box in the original pdf (the author of the pdf can do that using adobe acrobat pro or equivalent)
Find the text box using code and extract it's location
remove the text box from the document and insert your text at the extracted position.

Azure search unable to find location & highlighting of text in the PDF file

I am working on Azure search with 2 PDF files which contains images and text, I am able to execute the search results properly, However i am unable to highlight the text which is in image.
In a short i am not able to minimize my search to extract only the searched content.
I am using post man to get the results
Example:
When i call
https://XXXXXXXX.search.windows.net/indexes/azureblob-index/docs?api-version=2019-05-06&search=microsoft%20center
I gives me whole JSON with many text without exact highlighter
NOTE: i have enables OCR also in the indexer
Try setting the highlight=[string] parameter in your request.
See the Search Documents documentation for details.
If that doesn't help, can you explain more about your expected results?

After using pdftotext: find page of string from txt

I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?
You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.

Aptana Search Project Ignoring Certain File Extensions

I've tried search for this but have been unable to find anything on this. Basically, I'd like to be able to search the project that I'm working on but only search files with certain file extensions. Currently it searches everything and this adds time to the search that will never return results for what I'm looking for. Is there a way to do this or am I just out of luck?
Does your search dialog not look like this?
Ctrl + H or Search > Search will bring up this dialog which allows you to put in a file name pattern to limit your search... what are you doing?

Efficient Way of Recording Page Numbers from a Search of a PDF

I have a list of ~1200 queries (part numbers) that are specified somewhere inside of a 100 page PDF. Pretty much what I need to do is take record of what pages each of the queries appear on, in the PDF. I can't think of a clever way of doing this. It should take me 5-20 hours to do this search by search, so if someone can give me a good idea before the 5 hour mark that would be great!
Assumed you can determine what a "query" is in your context programatically from the plain text (for example, by using regular expressions):
You could split your PDF into different files (1 file per page) using pdftk
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Then convert those files to text with a pdf-to-text utility like this one:
http://www.fileguru.com/PDF-To-TXT-Converter/download
or this one
http://www.pdf2text.com/
And finally write yourself a simple script using your favorite programming language to determine which of those files contains a "query" (whatever that looks like).

Resources