If we have been provided only the XMLs of the document (in input stream, unzipped manner, or in a byte array), can we detect the file extension via parsing XMLs? My motive is to know what particular node in which XML determines that this is DOCX, PPTX, or XLSX file?
I unzipped the documents and tried to dig and found this -
In \docProps\app.xml, application node defines it -
<Application>Microsoft Excel</Application> for Excel,
<Application>Microsoft Office PowerPoint</Application> for PowerPoint, and
<Application>Microsoft Office Word</Application> for Word.
Related
In work place everyday we used different type of documents to hold data. For example, DOC, XLSX, PDF files. And sometime we use software (like adobe reader) excel to PDF converter.
As far i know another way to convert a document from excel to pdf is changing the document type from the SaveAs option (correct me if i am wrong) or changing the file extension.
My question is when we change the Document type from save as option does it change the code behind the file?
Another silly question is if we can convert the file by changing extension why we are paying for 3rd party software?!
Every document type has its format. So behind the screen, every type has its style. For example, XLSX format is a combination of XML and zip compression. PDF is a rich document representation format created by Adobe uses PostScript.
When you save a document as XLSX, the document will be saved as its standards. The saving method will be changed. As an answer to your first question, Yes the coding(method) will be changed when saving.
For the second question, the changing file format is not always an easy task. You need to change the encoding of the file when performing the conversion. When you change the extension you do not apply any conversion operation. You say your computer "This is an ... file.". But the encoding of the file is still unchanged.
Suppose we are migrating a set of MS Office files from (say) a Shared drive to SharePoint (eg SharePoint Online). Limited to Office 2007 onwards, so file extensions like DOCX, XLSX.
We see that the size of the file changes when it is saved to SharePoint - as certain metadata is added.
(Though file sizes of non MS Office files such as PDF or JPEG do NOT change).
These MS Office files are "containers" in which are placed a number of component parts - this situation can be viewed crudely by changing the Extension of an XLSX file (say) to ZIP, and opening it with WinZip.
For good sound integrity reasons we want to assure ourselves that the "File Content" component part has not changed.
How can we identify the component parts within those containers which represent the Content?
Are such component parts Invariant when when saved to SharePoint as described?
If so, are there any utilities which could analyse a pair of such files and confirm that the content is the same, or if it has been changed? Is there perhaps some checksum we could generate from both files and compare.
If no such utility exists what sort of environment would be best for creating one? - could it be done in VB.NET and/or C# for instance?
Thanks.
This previous post related to same issue, but does not provide the sort of answer we need. C# - Hash contents of MS Office documents without metadata
Interesting topic.
How can we identify the component parts within those containers which represent the Content?
within the docx you will need to assess each of the content files. Please be aware that the files within a docx are compressed using deflate. So you will probably have to inflate them. This is not only the document.xml and the document.xml.rels file but also:
- the header xml files (can be more then 1)
- header .rels files
- footer xml files (againa multiple files)
- footer .rels files)
- the media files (containing images)
You even have to check the core.xml file if SharePoint property demotion alters a field like title.
To summarise, you cannot compare the docx files at the docx level. You will need to unpack them and compare (use e.g. CRC32 or MD5) each of the "content" files.
I am not aware of utilities providing this functionality.
Note: if you just need to upload the files to SharePoint for archiving then placing them into separate zip files might be an alternative. This is of course only an option if you just need to store the content and do not expect the users to make any changes.
Paul
We want to program a pre-commit hook on our linux based svn server which executes some checks on the properties of ms word documents (e.g. author, version, etc.) during their initial check-in.
Is there any way to read out these properties with some e.g. scripting language or C++ code on a linux system?
Depending on what version of Word you're working with, possibly.
The DOCX format is really a ZIP file which contains a number of files (many XML) that make up the Word document. It's based on the Office Open XML format. If you unzip it and look in the docProps directory that's created, core.xml contains several nodes that may be of use to you: dc:creator, cp:lastModifiedBy, cp:revision. Interrogate those with your scripting language/XML library of choice.
How could I access the source code of a .one OneNote file?
I've tried to rename the .one file to .zip as what happens with .doc files in order to access their source code, but .one doesn't seem to work like that.
Also, I've tried to open it with Notepad++, but it isn't in a plain-text format.
I regard this as a programming question because:
I'm using content-editing-automation scripts (e.g. RegEx-related find and replace scripts). Accessing the source code of .one files helps me apply bulky automated edits on their content Using RegEx.
.one files aren't technically source code - they contain the data that describes the pages in a section and their content.
Opening them as text won't show you anything meaningful as they are binary data.
Microsoft has released the way this data is structured in .one files in the following documentation. You can use this to parse the binary file to obtain the information you need.
https://msdn.microsoft.com/en-us/library/dd924743(v=office.12).aspx
https://support.office.com/en-us/article/File-format-changes-in-OneNote-2016-for-Windows-a9129622-1755-470b-91e7-b2a461194036
The .one file format is super-complicated as it has to store images and all revisions, so it's binary and not XML-based like the rest of the office suite
That said if you do want to see the XML structure of the notebook or specific page content you can use OMSpy:
https://blogs.msdn.microsoft.com/johnguin/2011/07/28/onenote-spy-omspy-for-onenote-2010/
It works fine for 2016 Desktop.
I have created a simple Excel file using SpreadSheetGear. If I save it as an xls file
workbook.SaveAs("file.xls", SpreadsheetGear.FileFormat.Excel8);
and attach it to an email, I can open it on my phone (tested both with iPhone and Android).
If I save it as an xlsx file
workbook.SaveAs("file.xlsx", SpreadsheetGear.FileFormat.OpenXMLWorkbook);
and attach it to an email, I CANNOT open it on my phone.
If I open the xlsx file attachment on my computer and save it with no changes and attach it to an email, I now can open it on my phone.
Apparently Excel saves the file differently than SSG. The file size of the xlsx file attachment is 9 KB. When I open it on my computer and save it, the new file size is 24 KB.
Some of my users prefer the xlsx format. Is there anything I can do with to make the SSG generated file attachment open like an Excel generated file attachement?
iOS depends on certain attributes being present in the worksheet data of the Open XML file format to properly parse these files. SpreadsheetGear does not write these attributes out because they are listed as optional in the Open XML file format specification and, also, omitting them reduces file size, as you have noted. Excel, for whatever reason, always writes out these optional attributes and other third-party components often times rely on their presence to function correctly. SpreadsheetGear V5 added a workaround to write out these attributes by enabling a certain "Experimental" option. This option was added because the OLE DB provider also exhibits this errant behavior. You might try something like the following and see if this helps in getting SpreadsheetGear to better work with your viewer:
IWorkbookSet workbookSet = Factory.GetWorkbookSet();
workbookSet.Experimental = "OleDbOpenXmlWorkaround";
IWorkbook workbook = workbookSet.Workbooks.Open(#"C:\temp\BadWorkbook.xlsx");
workbook.SaveAs(#"C:\temp\GoodWorkbook.xlsx", FileFormat.OpenXMLWorkbook);
Please see the SpreadsheetGear.IWorkbookSet.Experimental property for more information on this feature.
From what I can tell, iOS/Andriod/etc often also depend on other certain optional features available in the file formats that SpreadsheetGear either doesn't support or write out by default. For instance, iOS depends on a "data cache" stored within charts to display chart series data points and SpreadsheetGear's support for writing out this data cache is limited. This can result in charts not displaying as expected in iOS, Android, etc.