How to find string in file contents - sharepoint-online

I'm trying to search for a string in a file that gets uploaded to SharePoint.
I'm using Send HTTP request to SharePoint to get the file content. But $content comes back in Base64 as an object. How would I search for a string in the returned $content ?
/_api/web/getfilebyserverrelativeurl('/sites/mySite/Shared%20Documents/myFileInWordOrPDFformat.docx')/$value?binaryStringResponseBody=true
headers:
{
"accept": "application/json;odata=nometadata",
"content-type": "application/json;odata=nometadata",
"odata-version": ""
}
Returns
{
"$content-type": "application/octet-stream",
"$content": "UEsDBBQAAAAIANZualN4vZl3IWQAAO2wAQARABwAd29-BlahBlahBlah-QQAAPCJAAAAAA=="
}
Posts like-
https://sharepoint.stackexchange.com/questions/273774/how-to-get-content-of-file-in-sharepoint-using-sharepoint-rest-api
https://debajmecrm.com/binary-to-base64-in-microsoft-power-automate-microsoft-flow/
https://linnzawwin.blogspot.com/2021/02/handle-base64-and-binary-file-content.html
have been helpful but I'm still not clear.
Do I need to convert the string I'm searching for to Base64 and search for it that way?
Any suggestions? Also, I don't have the option of utilizing premium connectors.
TIA!

This may not suit you but the way I would do it is to use the Cloudmersive document conversion actions to extract the text. It's a premium connector though so unless your user is enabled, it'll be a blocker.
It's the nicest way I found which can still be free if you don't have the need to call it more than 800 times a month and no more than once a second.
If your requirement exceeds that, you'll have to look at a paid subscription.
I looked at ways of using the SyncFusion PDF functionality in an Azure function which could be paired with the Adobe services to convert the DOCX file to a PDF but it started to become a bit too much effort for an answer in Stackoverflow. The Adobe task was easy, it was the reading of the PDF which became a little too much.
It's still an option though ... https://www.syncfusion.com/kb/7178/how-to-use-pdf-control-in-nodejs-environment
This is an example that allows you to search for text, you'd just need to adapt it using the previous link and those instructions. FYI, the reason for using Node is because C# Azure Functions only support .NET Framework in the V1 runtime which is very much obsolete ... https://learn.microsoft.com/en-us/azure/azure-functions/functions-versions ... and .NET Framework is what's required for the relevant Syncfusion functionality. I'm happy to be proven wrong by someone else in the community though, I could be wrong.
This is what the Cloudmersive connector will do for you ...
Sample Document
Flow
Result
Note: One thing I noticed is that the word "Microsoft" comes out wrong. It's a bit bizarre but you could always take that up with Cloudmersive.

Related

What format is this template string/formula field in?

I am building an integration for a client with the HelloWorks API. This is a service that allows you to create user-fillable PDF forms.
It appears that their product supports formatted/template strings/formula fields as outputs to the PDF files, as when
I create a Date field, it creates an output field with the following data:
#{format(:date, #field_XXXXX, "{0M}/{0D}/{YYYY}")}
However, I cannot find any kind of documentation as to what is the format of this template string/formula field.
In case anyone is still looking for an answer for this, this is what HelloWorks support team has to say.
As for documentation for the format, we do not have any documentation for that at this time and the ability to make those edits are going to be depreciated in the future as our product team wants to bring those capabilities to the portal without custom edits.

Does Office 365 image search work? If so, how?

According to Microsoft ("Image Analysis" in https://techcommunity.microsoft.com/t5/Microsoft-SharePoint-Blog/Enrich-your-SharePoint-Content-with-Intelligence-and-Automation/ba-p/194174, from May 21, 2018), we should be able to search for text within images.
Is this working for you/anyone? If so, I would like to know what you had to do to get it to work.
I have a SharePoint modern team site with PNG images that contain clearly readable text...but search will not find anything. I have requested re-indexing.
I have had a Microsoft Support request (#10638094) open since June 27 with this question/issue, and no one--even after escalation--has been able to answer it.
Based on the article above, it appears that "MediaService" column(s) should be added to the library to support this; however, I can find no such columns in the environment (using PnP export to review).
Naomi Moneypenny and Kathrine Hammervold highlighted this functionality at Ignite 2017 (https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK2181, about 27:00), but it doesn't seem to be available/working (at least not for me).
August 24: So, after research, digging yet further, I have an escalated support ticket at Microsoft (#10638094, unsolved) and there are conversations at https://techcommunity.microsoft.com/t5/Intelligent-Search-Discovery/Search-for-words-in-your-images-in-Office-365/ba-p/135703, https://techcommunity.microsoft.com/t5/Microsoft-SharePoint-Blog/Enrich-your-SharePoint-Content-with-Intelligence-and-Automation/bc-p/236625, and Does Office 365 image search work? If so, how?. I have yet to hear of this functionality working for anyone. I will keep digging, and I will certainly post if I hear anything. J
After some digging, from official it seems already released at the end of 2017. However there is no any related doc or official guide to this Text in image search function.
The 2 way i can think of perform text in image search.
Perform OCR yourself on the image before uploading the image and embed the text in image metadata.
Use support image type like IIRC and TIF that image are recognized.
In your case, you can upload the image and have another column that contains text and apply metadata to the image in a list/ library column.
OneDrive in another hand also has this function. For example, search for things like "cat" and it * should* pull up most pictures you have of cats. Its more likely using tag as label for the image instead of reading the picture it self.
Also, i believe OneNote has its index recognizable text and handwriting. Maybe this can point you to the right directions.
*Microsoft Azure's computer Vision offer service to recognized text in image. Maybe this can help.
"Is this working for you/anyone?" Yes, I responded to this post elsewhere and see it posted here, as well. Unfortunately, I cannot tell you HOW to get it to work or to verify that it is correctly configured. I can only suggest a test for you to see if it is working for you, as it works for me. I have not tested every way in which it could or should work. I have only discovered it working with PNGs I inserted into Wiki Pages in SharePoint Online. Those PNGs are generated using Snag-It to take Screen Captures and I do not see where Snag-It would be doing any OCR on the image to embed anything, etc. OCR is not even in the Snag-It help file, so I believe the PNG files are just simple PNGs. I insert them into the SharePoint Wiki page, which uploads them to the Site Assets library. And, when I search for a word in the image, the image is returned as a result - not the Wiki page. So, suggest you try a simple test of just inserting a PNG with text in it into a Wiki Page and give the index a bit of time to run to see if it works for you.
It seems like the functionality has matured recently. I have been testing it more thoroughly, and I have documented the results in my blog at http://www.collaboration-foundry.com/SharePointImageAnalysis.
Bottom line: It works for me in OneDrive and SharePoint (modern and classis), but I've only seen it work on the out-of-the-box Document content type--which limits custom solutions somewhat.
It's cool functionality when it works. Looking forward to seeing Microsoft build on this.
John

Google Docs: Table of content page numbers

we are currently building an application on the google cloud platform, which generates reports in Google Doc. For them, it is really important to have a table of content ... with page numbers. I know this is a feature request since a few years and there are add-ons (Paragraph Styles +, which didn't work for us) that provide this solution, butt we are considering to build this ourselves. if anybody has a suggestion on how we could start with this, it would be a great help!
thanks,
Best bet is to file a feature request on the product forums.
Currently the only way to do that level of manipulation of a doc to provide a custom TOC is to use Apps Script. It provides access to the document structure sufficient enough to build and insert a basic table of contents, but I'm not sure there's enough to do paging correctly (unless you force a page break on ever page...) There's no method to answer the question of "what page is this element on?"
Hacks like writing to a DOCX and converting don't work because TOCs are recognized for what they and show up without page numbers.
Of course you could write a DOCX or PDF with the TOC as you'd like and upload as a blob rather than as a Google Doc. They can still be viewed in Drive and such.

Automating book citation search

I have a list of books listed by their titles in a text file. I want to write a script which can use a web service like Google scholar or amazon to search for the books and return me a xml or bibtex file with citation info for each book.
Which programming tools can I use for this kind of automated search ?
Python would be my recommendation.
Get names from the text file, simple file reading
Construct a REST URL request to google's book API
http://books.google.com/books/feeds/volumes?q=Elizabeth+Bennet&start-index=21&max-results=10
Simple python code to get data from this URL (may need an API key, would advise using urllib2 with error handling rather than urllib)
Sample code,
import urllib
url = 'http://foo.api.request'
data = urllib.urlopen(url).read()
See the return schemas for this API (you can use the XML however you like).
See BibTeXML for conversion between the two formats.
HTH
I think it could be useful if you specify what kind of script you want to write!
Anyway... you could do some low level work and write your own HttpRequest for google and amazon or you could just rely on their API for example: http://code.google.com/apis/books/
There is a great project which does something similar what you want to do, it's called Shelves. It's written for Android but should give you some ideas how to handle your requests. Instead of downloading some citations it's downloading the cover.
http://code.google.com/p/shelves/
Just as a quick side note, saving your books in a xml file could be an option as well. In some cases it makes parsing them easier.

How to accurately determine if an SPFile instance is a converted file?

I have been working on a document conversion feature for converting a docx file to a pdf file using MOSS 2007. The SPFile.Convert() call is being made in the ItemAdded event and the ItemFileConverted event is fired fine as well. The eventing seems to be working fine, but the IsConvertedFile and SourceLeafName properties of the converted SPFile instance are not always set by the conversion process. This is what I was attempting to use to determine if a call to SPFile.Convert should be made.
In digging into the code for SPFile IsConvertedFile, GeneratingConverterId and SourceLeafName properties, it seems these are based on SPFile.Properties "vti_dttransformerid" and "vti_dtparentleafname". The problem is, these two properties are not being set consistently whenever I have code in my ISPConversionProcessor.PostProcess() implementation in which I was hoping to do some post processing of the file. If there is no code in the PostProcess method (only the runDefaultPostProcessing = true; statement) the properties are set more consistently.
I have some additional details here in a Wiki pageabout what is going on, but using .NET Reflector to determine where these fields are updated from hit a brick wall at OWSTIMER.EXE (I could find all of the reads for the properties, but even the HtmlLauncher and LoadBalancer services had no mention of these properties).
Has anyone done a complete Document Conversion implmentation and used the SPFile.IsConvertedFile and SPFile.SourceLeafName properties successfully?
If you can't trust the API, store the IsConverted metadata in the property bag for the SPListItem. Or if you prefer to show it in the UI, add another field to your list. This should all work fine from the event handler.
It's annoying to do the extra work but I guess there might be additional metadata that you can add which SPFile wouldn't have been able to provide anyway.
I have created a PDF Converter for SharePoint, but didn't use the Document Converter functionality as it didn't match our needs and was not flexible enough.
Not sure if this reply will be thrown out as spam as I am now going to link you to the place where you can download a free trial version. Download PDF Converter for SharePoint.
I feel a bit dirty now, but I may have actually helped you ;-)

Resources