Extract Keywords from Office Documents with Sharepoint Flow

Extract Keywords from Office Documents with Sharepoint Flow - sharepoint

I am trying to implement a document management system using Sharepoint. One major issue is that colleagues cannot find documents in the current setup (local fileserver). They have asked that we have a system that scans uploaded documents and automatically looks for keywords in them and then populates a "Meta" column.
I have had sort of success with OCR on image files, but getting keywords out of office documents (doc, xls etc.) I have had no success until now.
Is there a way to setup a flow to do this task for me?
any help is much aprechiated.
i tried "Get file metadata" and Azure "Text analysis", but it seems to take the raw data of the files (XML I assume) and returns that the document is to large to analyse.

There is something vague about this requirement - how is a keyword defined in a document?
Therefore, first obvious solution would be to assign keywords for each file upon uploading it. You may create a process for this with flow - have tasks, reminders and so on.
Automating this with OCR first means that you need to user OCR that works with MS flow you have only one choice - ElasticOCR. Then, in your flow
- feed the document content to the ElasticOCR action
- keep in mind that OCR is not 100% accurate
- analyze the generated text content according to your keyword definition
- finally write the meta back to the library in the corresponding columns.
Having worked on a similar requirement, we asked uploaders to publish their documents with a short abstract(column from the content type). The assumption is the abstract contains the keywords and is stored in a multi-line column - making it searchable site wide.

Related

saving mail attachment with PowerAutomate to sharepoint corrupts the file

I am trying to build a flow based on the PowerAutomate template
Create Planner task and add attachments to SharePoint on new email
arrival
This template works fine, in that it saves all the mail attachments to my sharepoint. But it only shows the link to the last attachment in the task.
I have worked around it, by adding a string variable and appending all the sharepoint paths to this variable.
With my Flow, everything runs smoothly. But the stored files are about 10%- 20% bigger in size than the original and they turn out to be corrupted.
The only difference I can spot in the saving of the file is as follows:
Template section has "get attachment" and the according "body('get attachment'):
While my in my version I can only select "get attachment (V2)" and the corresponding "body('get attachment (V2)')
There is an option with V2 that allows or disallows chunking, but there is no effect on my filesize.
The other difference is, that I have my flow create a different folder based on the task ID, since there where errors, if the same name attachment came a second time. But I have tried my flow without the added folders and there is no difference in file size.
The original files:
and the corrupted files:
It makes no difference if I use the sharepoint link provided through the flow to my new planner task, or if I open the files directly within sharepoint. The result is an error.
Can anyone guess, why my flow seems to store something more within the file and thus corrupting it? I can provide the other parts of the flow in more detail too. Here is the overview of my custom flow:

I actually found the answer after rewriting it from scratch:
Using the old template had me looking for the wrong information when adding the attachment content to sharepoint. I had always searched for "body" which was used in the template and gave me this
But searching for attachment the dynamic content actually showed me the right pieces. I am not sure, if I missed it before, or if recoding a template hid them somehow. With the rewrite from scratch I found this:
So, to make a long story short: Use "Content Bytes" of the "Get_Attachment_(V2)" Method and everything works fine.

Extract multiple Cognos report definitions

In COGNOS is there a way to get the definitions (filters, selected fields) from a number of reports in a folder?
I've inherited around 500 reports defined in a folder and they all need to be checked and fixed as they have business errors (not technical errors). If it was possible to get all their definitions in a single extract that would save an enormous amount of time having to click multiple times to get that information from each report one by one.
In ACCESS this can be done with VBA (for query definitions), but I'm not sure if there is a scripting language that can be used with COGNOS to achieve a similar result.

It sounds like you may want to "validate" each of these 500 reports (effectively equivalent to pressing the "validate" button on each individual report if it was open in the authoring studio).
Validation will ensure that a report specification XML is still syntactically correct, references a package which is still present the content store, references only query items from that package which still exist, generates valid SQL vs. the underlying datasource, etc.
If that's what you're looking for, an easy way to do batch validation for all 500 reports would be to use MotioPI (its a free admin tool for Cognos). Here's a short article which walks you through the process:
http://info.motio.com/Blog/bid/70357/Batch-Validation-of-Cognos-Reports
If you're wanting to retrieve the actual report specification (XML) for each of these 500 objects, then you'd need to write a program which utilizes the Cognos SDK to retrieve the specification XML from each of the 500 report objects. After that, you'd need to add logic which examines each of these 500 XML documents, looking for whatever it is you're looking for.

We solved this by exporting the XML of the reports using a SQL query on the content store.
The output is processed with a Python script to convert XML to table layout in CSV format.
This CSV-file can easely be imported in Excel.
You might want to process the reports XML directly in a SQL query with the xmltable function. In our situation this turned out to be a heavy proces we don't want to burden the content store database with. For a small set of reports this is working fine though.

Flexible customization - Generating word document using C#

Problem - Generate a word document from information retrieved from database.
My solution - Create a word document template add fields/tags in places where values need to be inserted. The template will require tables and charts as well. Using document reflector that comes with open office xml sdk reflect on the document template and extract the w:document section and port it to C#. The rest of the logic revolves simply around finding the fields/tags, replacing them, etc. Very simple approach but not very flexible!
Challenge - I want the user to have the ability to customize the template or the generated document output. But this will not be possible if I embed the template logic in code.
Any other possibilities - I looked around at Templating using T4 and RazorEngine but could not find any concrete examples of how to create word documents using these two technologies.
Now what is the best approach?
I would really appreciate your inputs on what is the best and most flexible way to generate word documents using C#.

I'm actually working a project where the business users are designing word template with mail merge fields and we are populating the values using a 3rd party software package Aspose Words. http://www.aspose.com/categories/.net-components/aspose.words-for-.net/default.aspx
The software includes a library for merging data from datatables into the mail merge fields in the word document.
I also wrote a customized word task pane add in that retrieves data views from the database and lists the fields in a drag/drop interface that mimics a crystal or sql report writing interface.
Probably would of been easier to just use crystal or sql reporting though...

It's certainly possible to generate the contents of an Office doc using T4 or Razor and then package it up. The TestScribe powertool for Visual Studio Test Manager does just that with T4. There is a thread by Sally Cavanagh in the Q&A on this page http://visualstudiogallery.msdn.microsoft.com/e79e4a0f-f670-47c2-9b8a-3b6f664bf4ae that suggests a way to look at the T4 templates that it uses, which might get you jump-started.

Here is sample to play word document template with C#

You could use a content control databinding approach.
XML Mapping Task Pane for Word 2007/2010 is an authoring tool.
To create an instance document, you just attach your XML data file.
If the resulting documents will be opened in Word, that is all that is required: Word will bind the data itself. If your consuming application is not Word, you might want to resolve the bindings yourself (eg via Open XML SDK).
Content control databinding isn't intended to support repeats and conditionals. For a way to do that, look at my OpenDoPE convention

Take a look at Templater. Disclamer: I'm the author.

Check out JODReports or Docmosis. They are Java based but some of the templating features and output options might be ideal. You can call the command line interfaces unless they also have something better to reach from C#.

Append a dynamically changing watermark to a PDF in SharePoint

This is primarily a question of possibilities more than instructions. I'm a programming consultant working on a WSS project site system for my client. We have a document library in which files are uploaded to go through a complex approval process. With multiple stages in this process, we have an extra field which dictates what the current status of the document is.
Now, my client has become enamored with the idea of PDF watermarking. He wants the document (which is already a PDF) to be affixed with a watermark corresponding to the current status, such that with each stage of the approval process the watermark will change.
One method, the traditional method for PDF watermarking, of accomplishing this is to have one "clean" copy of the document somewhere hidden on the site, and create a new PDF from it that has the watermark at each stage of the approval process. Since the filename will never change, this new PDF can be uploaded continually to a public library, always overwriting the old version and simulating a "dynamically changing watermark". However, in the various stages there will also be people uploading clean copies with corrections and suggestions, nevermind the complex nature of juggling around two libraries and the fact we double the number of files stored. My client and I agree that this is not a practical path to choose.
What we would like to do is be able to "modify" the watermark in a PDF, so that we only have to keep one copy of the file. Unfortunately, from what I've seen, in most cases when you make something like a watermark, which in its nature is supposed to be "unmodifyable", you won't be able to edit it later. So, is it possible to have a part of a PDF which cannot be changed by anyone who downloads the file, but can be changed as part of a workflow or other object model process?

PDF Watermarking in SharePoint is a common request. I have written extensively on this topic. See:
Adding a dynamic watermark to a PDF file from a SharePoint Workflow
Adding a (static) watermark to a PDF file from a SharePoint Workflow
Use SharePoint Workflows to inject JavaScript into PDFs and print the ‘open date’

You could use Event Handlers such that code was run every time a document was checked in. In that code you could perform the fixup/check that made the watermark be what you wanted it to be. This assumes you can write code that manipulates a PDF's internal structure such that it has the watermark that you desire.

It sounds to me like you want to allow people to modify the PDF they download, but not modify its watermark. This is probably going to be nigh on impossible if the watermark is embedded in the PDF (afaict) but what if the watermark image is external to the PDF; is it possible to embed a watermark in a PDF that is sourced via HTTP? Then you could embed:
<watermark image="http://sharepoint/site/_vti_bin/docstatus.asmx?id=5">
Of course, I have no idea about PDFs, so this might not be possible but you get the concept.
-Oisin

It is possible to do so if you use third party tool. Then you can put dynamically binded value from your SharePoint metadata, conditions, rules etc: http://www.pdfsharepoint.com

When is SPFile.Properties != to SPFile.Item.Properties in SharePoint?

One of our customers has a problem that we cannot reproduce. We programmatically copy a document's properties to a destination file using SPFile.Properties. However, for some reason the file's properties do not match the meta data specified on the list the file is stored in.
Now, we can probably solve this by copying SPFile.Item.Properties (not tested yet), but I am just wondering under what circumstances SPFile.Properties is unequal to SPFile.Item.Properties.
Update: We have just received an update from our customer. Using SPFile.Item.Properties always returns the up to date information. However, we still would like to understand the original question.

There is a slight difference between SPFile.Properties and SPFile.Item fields and the first one is much, much slower to call.
You have most probably seen Microsoft Office document's "properties" window (this one - http://dradisframework.org/images/tutorial/custom_document_properties.png). These are the properties that are read when you access SPFile.Properties. Reading them is slow since there is some code infrastructure that parses the binary DOC file and finds the properties. (takes up to 30 or something milliseconds for every property access) See more here: http://msdn.microsoft.com/en-us/library/microsoft.sharepoint.spfile.properties.aspx
In SharePoint, every item is an SPListItem and its field values (and I don't use the word "properties" on purpose here) are stored in Sharepoint's content database. So, when you access SPFile.Item.Properties, you actually look at the SPListItem to which the file is attached and look at its properties from SharePoint's content database.
What happens behind the scene, when you upload a file having some "Office properties" set, is that SharePoint copies them to same-named fields in SPListItem. (Some information about it here: http://weblogs.asp.net/bsimser/archive/2004/11/22/267846.aspx)
This is why these properties typically have the same value, BUT it only happens if SharePoint knows how to read metadata from your file and write them back. So, in case you put a .txt file in your SharePoint store, you will not get any SPFile.Properties back.

The user will always see the ListItem Properties and not the SPFile properties in a document library. So using the ListItem properties in the copy is the way to go.

I believe this issue is related to the Sharepoint property promotion/demotion feature which enables document properties to be embedded in the physical MSOffice file and travel with it to the client etc. This however is only supported currently for Office file types (to my knowledge).
Jonathan

Trying to find the "official documented" anything for sharepoint is pretty much undoable. :-D. The online docs suck, you are better of using blog entries etc.
P.S. I agree with Alex here. Although an SPFile never exists in a list without an accompanying SPListItem, the connection between the 2 can get corrupted (i.e. being able to edit the list item but the file is not openable). This to me indicates information about the 2 is stored in different locations in the content db. I have had this happen before.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string