I am not sure if i am going to be able to describe this right but ill give it a go.
We are working on implementing Azure search. At the core level we have searchable PDF documents that we want the text of them added to the index so all of them are searchable.
The initial thought was to just submit that document to the index via the add document rest api. The thinking was that this would be the most simple and quickest path
to getting the text of that document into the index. We also considered using and indexer and just having all the Searchable PDF docs in a blob store and have the indexer
crawl those every 10-15 mins.
We also looked into (based on a recommendation) submitting a standalone JSON file with the text from the PDF in it. Submitting that to the index either via the same add document API or
placing that file in a blob store. Within the JSON document we would need to have document identifiers that provide the index with the location of the PDF so that when that text is found
via search, we can make that clickable and as a result open the PDF.
It seems to me that pushing in the json file with the document add api. Indexing that and when it is part of a search we can use the doc id to link back to it and open it.
For those of you that have used Azure search. How did you implement?
If you're totally sure that only pdf will live on this particular index, then the first approach is faster to implement, since the native indexer can be used for extract the content of the pdf document as well to push it to the index.
Both approaches will work, but for the second one, you would need to extract the pdf yourself using an external tool.
Related
Just looking for guidance or even a general outline on approach here.
I am using azure search to OCR a batch of pdfs. I have turned on hit highlighting and I am successfully getting results back there that I am looping through / displaying in my view for the end user. I was looking on expanding that functionality to show the pdf images with the highlighting on the images themselves like in the JFK azure example. I am not proficient in react and seem to be getting lost there.
I am assuming I need to save off the OCR images to a data store for reference using the normalized_images that are created? I do have pdfs locally I can load but assume the OCR images maybe different. Have turned on GeneratedNormalizedImagesPerPage and turned on cache which creates files in my storage account.
Then I assume I need to pull the associated image, display it, use the highlight results and pull a corresponding bounding box where the phrase was detected? Problem with that approach is that I do not see any association between the highlight hit and the location (bounding box) of the hit nor the associated image file the hit was on.
Probably way off on approach here but any guidance is appreciated.
Edit 1
I did noticed the items on this page in the JFK example: https://github.com/microsoft/AzureSearch_JFK_Files/tree/master/JfkWebApiSkills/JfkWebApiSkills
Would trying to replicate the ImageStore (so those are stored in my storage account) and then the HocrGenerator (appears to handle points in a doc) into my skillset for my index be the approach?
There are a few steps here:
you need to save the layoutText from the OCR skill somewhere the UI can access it. The JFK Files demo converts it to a HOCR (to display in the UI) and saves it in index as a field in the index so that it is retrieved in the search results. HOCR isn't necessary and you may find it more efficient to store the layout in blobs using a knowlege store object projection.
save the extracted images into blob storage using a file projection into the knowledge store. Keep in mind that the images may be resized in the process and the coordinates will match the resized image saved to the store. If you want to map the coordinates to the original image see this.
At search time, map the highlight to the the metadata. You will find this code in the nodejs frontend, however it may be simpler to follow in the original demo by following the code here. Essentially you just find the first occurrence of the highlighted word in the metadata, display the associated image, and calculate the bounding region of the word.
I am trying to implement a document management system using Sharepoint. One major issue is that colleagues cannot find documents in the current setup (local fileserver). They have asked that we have a system that scans uploaded documents and automatically looks for keywords in them and then populates a "Meta" column.
I have had sort of success with OCR on image files, but getting keywords out of office documents (doc, xls etc.) I have had no success until now.
Is there a way to setup a flow to do this task for me?
any help is much aprechiated.
i tried "Get file metadata" and Azure "Text analysis", but it seems to take the raw data of the files (XML I assume) and returns that the document is to large to analyse.
There is something vague about this requirement - how is a keyword defined in a document?
Therefore, first obvious solution would be to assign keywords for each file upon uploading it. You may create a process for this with flow - have tasks, reminders and so on.
Automating this with OCR first means that you need to user OCR that works with MS flow you have only one choice - ElasticOCR. Then, in your flow
- feed the document content to the ElasticOCR action
- keep in mind that OCR is not 100% accurate
- analyze the generated text content according to your keyword definition
- finally write the meta back to the library in the corresponding columns.
Having worked on a similar requirement, we asked uploaders to publish their documents with a short abstract(column from the content type). The assumption is the abstract contains the keywords and is stored in a multi-line column - making it searchable site wide.
I am trying to search for keywords contained in the metadata of a PDF doc. I am unsure if this is possible. Any guidance would be much appreciated!
Here is an example of the keywords/tags in a PDF I am referring to
I know it is possible to add fields to the search index, but am unsure how to map it. I have tried the following but it did not work.
Here is how the keywords metadata would work -
Adding a keywords (metadata) to the pdf file would not work as only selected custom metadata tags are supported for pdf.
Refer this document - https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage
A work around to this problem could be add metadata tag to the pdf file blob itself.
After we create a index in azure search for ("All Metadata"/Storage Metadata) this key starts appearing under the list of field names to select(search/retrieve/filter etc.).
And finally we can search on the custom keywords now.
The Keywords tag is not one of the ones we support through the metadata_ format (the ones that are, are listed here). If you add a field to the index called "Keywords", does it extract it? Also, I if you look at the properties of the PDF in something like Azure Storage Explorer, I assume this keyword metadata is still there and it is called "Keywords". If not, this might give some additional insight.
How do I search in MS Access (ver 2010) for data in files attached to records? If I do a "Find" and specify text I KNOW is in an attached txt file to a particular record, there are no hits. While if I have the same data in a Text Field or Memo field, Access finds it. I understood from one of the Access help screens I found that it is possible to search attachments from within Access, but I have not been able to do this yet.
BTW, I did try using the query tool and searching for text I knew was in the attachment, but it was not successful, although it did find the same text within a memo field in another record.
Thx,
jmb
I'm fairly certain that there is no mechanism in Access to find records based on text within a file attachment. A bit of web searching found an earlier question here and the responses seem to agree that there isn't.
One reference from Microsoft here says
By using attachments, you open documents and other non-image files in their parent programs, so from within Access, you can search and edit those files.
but I think that statement could be misinterpreted. I believe what they meant to say was that
"...from within Access you can open an attachment in its parent program and then work on it as usual (e.g., edit it, search it, print it, and so on)."
You can use file system object, open the file as string and search sequentially. That's as close as you'll get
I am passing the search query in to search.asmx to get the search value.
Through web services I am retrieving the search result. Search result will return document path for .txt files and image. This path used to open the file directly.
txt file: "http://server:24669/jap/ww.txt- It will open the file.
PDF File:"http://server:100/456efg/Forms/DispForm.aspx?ID=3&RootFolder=/456efg"- It will show PDF properties or parent folder.
So I need to Get the Url to open the PDF doc. "ows_EncodedAbsUrl" column have the document URL but it’s not retrievable in search result. Is there any way to solve the issue?
If you add a PDF iFilter to your SharePoint environment, PDF files will no longer be treated as list items (thus the property view link).
Of course Adobe post the instructions for this as a PDF.
This change will also start indexing the text of your PDF documents so they will be more searchable. Be aware that if a percentage of the PDF documents size will be added to your search storage costs, so plan ahead.
This is a cure for the symptom, I do not know if there are other ways to do this.