Readout properties of a ms word document on a linux system

Readout properties of a ms word document on a linux system - linux

We want to program a pre-commit hook on our linux based svn server which executes some checks on the properties of ms word documents (e.g. author, version, etc.) during their initial check-in.
Is there any way to read out these properties with some e.g. scripting language or C++ code on a linux system?

Depending on what version of Word you're working with, possibly.
The DOCX format is really a ZIP file which contains a number of files (many XML) that make up the Word document. It's based on the Office Open XML format. If you unzip it and look in the docProps directory that's created, core.xml contains several nodes that may be of use to you: dc:creator, cp:lastModifiedBy, cp:revision. Interrogate those with your scripting language/XML library of choice.

Related

How can I use python to edit docx and/or doc file tags on a windows system?

I have a folder with a large amount of .doc and .docx files, I would like to develop a python script to edit the tags of each file so I can find a file in the folder using the tags - thus making my life a little easier.
I am unsure of how to even start and was hoping someone could point me to a library or provide some sample code to help me get started.
I am not sure if the file extenstion matters because this seems to be a windows property (right-click file > Properties > Details > Tags > type in tags) but if the extension matters I do can change all the files to be .docx

The python-docx package provides methods to access most of the metatdata in a word file. The class docx.opc.coreprops.CoreProperties in specific allows you to modify author, category, etc. I didn't see tags mentioned but if you do some more research i'm sure you can find it.

docx.opc.coreprops.CoreProperties.keywords can be used to update doc file tags.

SharePoint changes the Size of MS Office files when first saved, since Metadata is added. Possible to confirm that Content has not changed?

Suppose we are migrating a set of MS Office files from (say) a Shared drive to SharePoint (eg SharePoint Online). Limited to Office 2007 onwards, so file extensions like DOCX, XLSX.
We see that the size of the file changes when it is saved to SharePoint - as certain metadata is added.
(Though file sizes of non MS Office files such as PDF or JPEG do NOT change).
These MS Office files are "containers" in which are placed a number of component parts - this situation can be viewed crudely by changing the Extension of an XLSX file (say) to ZIP, and opening it with WinZip.
For good sound integrity reasons we want to assure ourselves that the "File Content" component part has not changed.
How can we identify the component parts within those containers which represent the Content?
Are such component parts Invariant when when saved to SharePoint as described?
If so, are there any utilities which could analyse a pair of such files and confirm that the content is the same, or if it has been changed? Is there perhaps some checksum we could generate from both files and compare.
If no such utility exists what sort of environment would be best for creating one? - could it be done in VB.NET and/or C# for instance?
Thanks.
This previous post related to same issue, but does not provide the sort of answer we need. C# - Hash contents of MS Office documents without metadata

Interesting topic.
How can we identify the component parts within those containers which represent the Content?
within the docx you will need to assess each of the content files. Please be aware that the files within a docx are compressed using deflate. So you will probably have to inflate them. This is not only the document.xml and the document.xml.rels file but also:
- the header xml files (can be more then 1)
- header .rels files
- footer xml files (againa multiple files)
- footer .rels files)
- the media files (containing images)
You even have to check the core.xml file if SharePoint property demotion alters a field like title.
To summarise, you cannot compare the docx files at the docx level. You will need to unpack them and compare (use e.g. CRC32 or MD5) each of the "content" files.
I am not aware of utilities providing this functionality.
Note: if you just need to upload the files to SharePoint for archiving then placing them into separate zip files might be an alternative. This is of course only an option if you just need to store the content and do not expect the users to make any changes.
Paul

How could I access the source code of a .one OneNote file?

How could I access the source code of a .one OneNote file?
I've tried to rename the .one file to .zip as what happens with .doc files in order to access their source code, but .one doesn't seem to work like that.
Also, I've tried to open it with Notepad++, but it isn't in a plain-text format.
I regard this as a programming question because:
I'm using content-editing-automation scripts (e.g. RegEx-related find and replace scripts). Accessing the source code of .one files helps me apply bulky automated edits on their content Using RegEx.

.one files aren't technically source code - they contain the data that describes the pages in a section and their content.
Opening them as text won't show you anything meaningful as they are binary data.
Microsoft has released the way this data is structured in .one files in the following documentation. You can use this to parse the binary file to obtain the information you need.
https://msdn.microsoft.com/en-us/library/dd924743(v=office.12).aspx
https://support.office.com/en-us/article/File-format-changes-in-OneNote-2016-for-Windows-a9129622-1755-470b-91e7-b2a461194036

The .one file format is super-complicated as it has to store images and all revisions, so it's binary and not XML-based like the rest of the office suite
That said if you do want to see the XML structure of the notebook or specific page content you can use OMSpy:
https://blogs.msdn.microsoft.com/johnguin/2011/07/28/onenote-spy-omspy-for-onenote-2010/
It works fine for 2016 Desktop.

Detect correct file extension for OpenXmls?

If we have been provided only the XMLs of the document (in input stream, unzipped manner, or in a byte array), can we detect the file extension via parsing XMLs? My motive is to know what particular node in which XML determines that this is DOCX, PPTX, or XLSX file?

I unzipped the documents and tried to dig and found this -
In \docProps\app.xml, application node defines it -
<Application>Microsoft Excel</Application> for Excel,
<Application>Microsoft Office PowerPoint</Application> for PowerPoint, and
<Application>Microsoft Office Word</Application> for Word.

A Batch File which contains a Lotus Script

Is it possible to run a batch file containing a lotus script? Would it also be possible to include a lotus script and then another language for example ksh's? If yes then could you please give me some samples or tutorials on how to do it?
What I need to do is this:
There is already an existing batch file which contains a ksh's that updates the value in an excel files every time it is executed.
What I need to do is include two new functions, first I need to download the excel file from a rich text in a document of lotus notes, then run the functions above written in ksh's after that
I need to re-upload it or update the excel file which is in the lotus notes document. I used lotus script for the added functionality.
I also don't know how to use or create ksh's and batch files. Thanks.

I personally would turn around the logic: why not use a scheduled LotusScript or Java- Agent, detach the file from the richtextitem and then run the ksh from there (e.g. using the Shell- Command of LotusScript)...
That way you can code the stuff you need in the languages that are best for your purpose. You could even attach the ksh to a configuration document and detach it on the fly Or build the ksh completely on the fly (with write commands)... That makes this solution replicate to any number of servers without having to distribute your ksh to each of them...

LotusScript runs only within a scripting host engine provided by IBM Lotus, but LotusScript isn't the only way to access Lotus Notes data.
You haven't said what platform you are running ksh on. You mention that you are operating on Excel files, so if you are running your scripts on Windows it may be possible for you to use the Lotus Notes COM classes. Those classes are almost exactly the same as the back-end classes that you would have available in LotusScript, but I have no idea whether any version of ksh (not to mention whatever version you are using) supports the CreateObject call or any other way to access COM classes.
However, a ksh script can certainly run Java programs, and there are Java classes for Lotus Notes that are (again) almost exactly the same as the back-end classes that you would use in LotusScript. It seems to me that the obvious thing for you to do is write a small Java program to retrieve the file from the Domino server, and another Java program to re-upload it after. Then have your script run the program to do the download, run the commands to modify the Excel data, and then run the program to do the upload.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Readout properties of a ms word document on a linux system - linux

Related

How can I use python to edit docx and/or doc file tags on a windows system?

SharePoint changes the Size of MS Office files when first saved, since Metadata is added. Possible to confirm that Content has not changed?

How could I access the source code of a .one OneNote file?

Detect correct file extension for OpenXmls?

A Batch File which contains a Lotus Script

Categories

Resources