Search keywords in PDF blob - Azure Search

Search keywords in PDF blob - Azure Search - azure

I am trying to search for keywords contained in the metadata of a PDF doc. I am unsure if this is possible. Any guidance would be much appreciated!
Here is an example of the keywords/tags in a PDF I am referring to
I know it is possible to add fields to the search index, but am unsure how to map it. I have tried the following but it did not work.

Here is how the keywords metadata would work -
Adding a keywords (metadata) to the pdf file would not work as only selected custom metadata tags are supported for pdf.
Refer this document - https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage
A work around to this problem could be add metadata tag to the pdf file blob itself.
After we create a index in azure search for ("All Metadata"/Storage Metadata) this key starts appearing under the list of field names to select(search/retrieve/filter etc.).
And finally we can search on the custom keywords now.

The Keywords tag is not one of the ones we support through the metadata_ format (the ones that are, are listed here). If you add a field to the index called "Keywords", does it extract it? Also, I if you look at the properties of the PDF in something like Azure Storage Explorer, I assume this keyword metadata is still there and it is called "Keywords". If not, this might give some additional insight.

Related

Azure search adding documents to index approaches

I am not sure if i am going to be able to describe this right but ill give it a go.
We are working on implementing Azure search. At the core level we have searchable PDF documents that we want the text of them added to the index so all of them are searchable.
The initial thought was to just submit that document to the index via the add document rest api. The thinking was that this would be the most simple and quickest path
to getting the text of that document into the index. We also considered using and indexer and just having all the Searchable PDF docs in a blob store and have the indexer
crawl those every 10-15 mins.
We also looked into (based on a recommendation) submitting a standalone JSON file with the text from the PDF in it. Submitting that to the index either via the same add document API or
placing that file in a blob store. Within the JSON document we would need to have document identifiers that provide the index with the location of the PDF so that when that text is found
via search, we can make that clickable and as a result open the PDF.
It seems to me that pushing in the json file with the document add api. Indexing that and when it is part of a search we can use the doc id to link back to it and open it.
For those of you that have used Azure search. How did you implement?

If you're totally sure that only pdf will live on this particular index, then the first approach is faster to implement, since the native indexer can be used for extract the content of the pdf document as well to push it to the index.
Both approaches will work, but for the second one, you would need to extract the pdf yourself using an external tool.

Extract Keywords from Office Documents with Sharepoint Flow

I am trying to implement a document management system using Sharepoint. One major issue is that colleagues cannot find documents in the current setup (local fileserver). They have asked that we have a system that scans uploaded documents and automatically looks for keywords in them and then populates a "Meta" column.
I have had sort of success with OCR on image files, but getting keywords out of office documents (doc, xls etc.) I have had no success until now.
Is there a way to setup a flow to do this task for me?
any help is much aprechiated.
i tried "Get file metadata" and Azure "Text analysis", but it seems to take the raw data of the files (XML I assume) and returns that the document is to large to analyse.

There is something vague about this requirement - how is a keyword defined in a document?
Therefore, first obvious solution would be to assign keywords for each file upon uploading it. You may create a process for this with flow - have tasks, reminders and so on.
Automating this with OCR first means that you need to user OCR that works with MS flow you have only one choice - ElasticOCR. Then, in your flow
- feed the document content to the ElasticOCR action
- keep in mind that OCR is not 100% accurate
- analyze the generated text content according to your keyword definition
- finally write the meta back to the library in the corresponding columns.
Having worked on a similar requirement, we asked uploaders to publish their documents with a short abstract(column from the content type). The assumption is the abstract contains the keywords and is stored in a multi-line column - making it searchable site wide.

File of custom type are not searchable in Alfresco using Advance search

I am seeing below behavior in Alfresco and read lots of relates doumentaion of alfresco but not found any clear answer.
Below are things I have done to search a file.
1. Uploaded a file named "Test.txt" in a folder having only one rule to have custom type on the uploaded docs.
2. And when I select content in "look for" option in advance search then my test file comes in result of search.
as shown below.
Then I have searched it using advance search using name property and selecting my custom type in the "look for" option in advance search then it result 0 files.
But when I set any property of test.txt file it becomes searchable using custom type in Advance search.
My question is If I just upload a file. How can it become searchable using custom type in Advance search.?
When is the indexing generated of files uploaded of custom type.
I am using Alfresco 4.1 and Solr as search engine.
Thanks,
Fouad

SOLR indexes Alfresco every 15 seconds by default, so there's no reason why your uploaded file wouldn't be indexed right away.
Are you sure your rule actually works?
I'd suggest taking the file's nodeRef, and using Node Browser to look at it's type, aspects and properties right after it enters the folder (and triggers the rule), and after you change something by hand. That might clear something up, as in why it works/does not work.
Additionally, you could search for unindexed nodes and see if your file is there:
http://docs.alfresco.com/5.0/concepts/solr-index-fix.html

Searching from File title as well as file content in media library

I managed to Search the contents of text files using custom search as described in the link below: https://docs.kentico.com/k8/custom-development/miscellaneous-custom-development-tasks/smart-search-api/creating-custom-smart-search-indexes
But it is not able to search in the filename. For example, if my search text is "Roman", the file "RomanRaj.txt" should show up in the results. Please help.

Try to add file name to your search index by index content customization. See the documentation on this topic.

I'd suggest NOT creating a custom smart search index but look at using attachments and searching those. Out of the box, Kentico will allow you to search attachments and their contents without writing any code.

How to retrieve the file path column "ows_EncodedAbsUrl" in search result.

I am passing the search query in to search.asmx to get the search value.
Through web services I am retrieving the search result. Search result will return document path for .txt files and image. This path used to open the file directly.
txt file: "http://server:24669/jap/ww.txt- It will open the file.
PDF File:"http://server:100/456efg/Forms/DispForm.aspx?ID=3&RootFolder=/456efg"- It will show PDF properties or parent folder.
So I need to Get the Url to open the PDF doc. "ows_EncodedAbsUrl" column have the document URL but it’s not retrievable in search result. Is there any way to solve the issue?

If you add a PDF iFilter to your SharePoint environment, PDF files will no longer be treated as list items (thus the property view link).
Of course Adobe post the instructions for this as a PDF.
This change will also start indexing the text of your PDF documents so they will be more searchable. Be aware that if a percentage of the PDF documents size will be added to your search storage costs, so plan ahead.
This is a cure for the symptom, I do not know if there are other ways to do this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search keywords in PDF blob - Azure Search - azure

Related

Azure search adding documents to index approaches

Extract Keywords from Office Documents with Sharepoint Flow

File of custom type are not searchable in Alfresco using Advance search

Searching from File title as well as file content in media library

How to retrieve the file path column "ows_EncodedAbsUrl" in search result.

Categories

Resources