How do digital newspaper libraries' search engines work? - search

For example, if I search the word "mexico" in the National Newspaper Library, one of the results looks like this:
http://www.hndm.unam.mx/consulta/resultados/visualizar/558a36117d1ed64f16c21cdc?resultado=2&tipo=pagina&intPagina=1&palabras=mexico
The site is able to highlight my search term. Nonetheless, the exportable PDF doesn't contain (apparently) any of the data that the search engine uses to match my query. In fact it is a /Subtype/Image PDF file.
Where is the text information and how is the engine able to highlight the result on the PDF doc?

Related

OCR from Slightly Different PDFs

I am working on a project where i have to extract information from PDF documents. While the documents follow similar format, few documents are slightly different in their format, how do i handle this using python.
I am working with form 483 available on the FDA website.
Site Link
I want to extract employee information mentioned at the bottom of the page. The format of document varies slightly. How can I extract information.
Example Documents:
https://www.fda.gov/media/101442/download
https://www.fda.gov/media/135387/download
https://www.fda.gov/media/89200/download

Extracting badly formatted text from a PDF

I'm trying to extract some entries from a PDF, but the bad formatting is making it inconvenient to simply parse through like a normal document. There isn't any consistent positioning for the text, so each entry is a unique scramble with no consistent pattern I can find. I only want the entry name and the info on the right, not the field name or description.
I've tried experimenting with headers and layout info using the PyPDF2 Module but there doesn't seem to be any metadata for the PDF besides basic author info.
My idea was using the Google Cloud Vision API to transcribe the text, but that brings up issues of auto-positioning.
Does anyone know of a better methodology for this, or if not, simply how to execute the positioning for the Cloud Vision API?

Search keywords in PDF blob - Azure Search

I am trying to search for keywords contained in the metadata of a PDF doc. I am unsure if this is possible. Any guidance would be much appreciated!
Here is an example of the keywords/tags in a PDF I am referring to
I know it is possible to add fields to the search index, but am unsure how to map it. I have tried the following but it did not work.
Here is how the keywords metadata would work -
Adding a keywords (metadata) to the pdf file would not work as only selected custom metadata tags are supported for pdf.
Refer this document - https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage
A work around to this problem could be add metadata tag to the pdf file blob itself.
After we create a index in azure search for ("All Metadata"/Storage Metadata) this key starts appearing under the list of field names to select(search/retrieve/filter etc.).
And finally we can search on the custom keywords now.
The Keywords tag is not one of the ones we support through the metadata_ format (the ones that are, are listed here). If you add a field to the index called "Keywords", does it extract it? Also, I if you look at the properties of the PDF in something like Azure Storage Explorer, I assume this keyword metadata is still there and it is called "Keywords". If not, this might give some additional insight.

Searching from File title as well as file content in media library

I managed to Search the contents of text files using custom search as described in the link below: https://docs.kentico.com/k8/custom-development/miscellaneous-custom-development-tasks/smart-search-api/creating-custom-smart-search-indexes
But it is not able to search in the filename. For example, if my search text is "Roman", the file "RomanRaj.txt" should show up in the results. Please help.
Try to add file name to your search index by index content customization. See the documentation on this topic.
I'd suggest NOT creating a custom smart search index but look at using attachments and searching those. Out of the box, Kentico will allow you to search attachments and their contents without writing any code.

Does SharePoint 2013 file search favour Microsoft documents over PDFs?

I have a Content Source which crawls a network folder containing Word, PowerPoint and PDF documents. I have in addition a Result Source based directly on this content source and a Search Results web part which uses the Result Source as its query. If I search for “Digital Cameras” the first result is a PowerPoint document entitled “Digital Cameras: Thriving Amidst a Declining Market.” However, there is a PDF file also in the directory with the exact same title, but this file does not appear unless I filter by PDF Result Type, at which point it appears at the top of the list. In fact, with Result Type set to All, I cannot see the PDF version of the file even if I click through all the pages of the initial search.
I thought it might be considered a duplicate but I have “Show View Duplicates” checked and “Trim Duplicates” set to false. The pop-out next to the initial search item does not show a duplicate.
How do I get the PDF document to appear in the basic search next to the PowerPoint document with the same title?
In your search center (which may or may not be your main site or a specific subsite designated through Central Admin), go to site settings and then Search Result Types and there should be a list of the result types included in your default search.
Provided you are crawling the PDFs in your search crawl (Central Admin > go to your Search Service > Crawling > File Types) , and they have content that can be indexed (namely text in the documents and title and not just images inside the PDF), you should be seeing it then.
OK, I have added the question at the other place:
https://sharepoint.stackexchange.com/questions/137512/does-sharepoint-2013-file-search-favour-microsoft-documents-over-pdfs
Graham, thanks I checked the result types and the crawling types and all looks OK. Plus, if the item was not being crawled it would not show up in the search even with the filtering.

Resources