I would like to extend the question asked here for other types of byte streams. I would like know how can I map byte streams extension to SAPI_ENUM_FILE_TYPE? I know that pdf files should be mapped to SAPI_ENUM_FILE_TYPE.SAPI_ENUM_FILE_ADOBE however I am not quite sure how to perform it for other type of files (e.g. Office documents)
Taken from the online documentation:
SAPI_ENUM_FILE_NONE - Not in use.
SAPI_ENUM_FILE_WORD - Word file (.doc file).
SAPI_ENUM_FILE_ADOBE - PDF file (Adobe).
SAPI_ENUM_FILE_TIFF - TIFF file.
SAPI_ENUM_FILE_DETACHED - Not supported in the current version.
SAPI_ENUM_FILE_P7M - Not supported in the current version.
SAPI_ENUM_FILE_XML - XML File. Supported from version 5.
SAPI_ENUM_FILE_OFFICE_XML_PACKAGE - Office 2007 file type (.docx file or .xlsx file).
SAPI_ENUM_FILE_INFOPATH_XML_FORM - InfoPath 2007/2010/2013 form (.xml file).
Related
I'm trying to implement a minimal version of .zip file generation following this spec: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
I don't actually need compression, I just need a way to string together a bunch of files into a single widely adopted archive format with the capability to stream in file data while streaming out the zip.
So far I'm partially successful, 7-zip and windows built in zip extractor can extract them just fine, winrar and macos built in zip extractor are giving me corrupted archive errors.
I can't for the life of me find the actual problem(s?) though, as far as I can tell the .zips are built 100% to the specification but the spec is a big wall of text and with swooping changes from one zip file version to the next along with legacy attributes taking on new functions it is tad confusing.
Does anyone know of an extraction tool that can give me more specific errors than just "archive is corrupt"?
Or perhaps a zip generation utility where I can pick and choose between all the different ways of building a zip file so I can go and compare the results byte by byte?
Does anyone know of an extraction tool that can give me more specific errors than just "archive is corrupt"?
The unzipada tool # Zip-Ada project will do exactly that
Testing archive ko.zip
raised ZIP.ARCHIVE_CORRUPTED : Bad (or no) end-of-central-directory
[C:\Ada\za\unzipada.exe]
Zip.Find_First_Offset at zip.adb:589
Unzip.Extract at unzip.adb:667
Unzipada at unzipada.adb:259
By browsing the code (like: zip.adb, line 589) you can narrow down the corrupt archive issues. For building the tool, download the sources and follow the readme.txt file. There are also pre-built binaries for Windows.
As a kind of follow up of this question, I would like to ask regarding binary files (such us excel files) and versioning.
Let's say I want to use github to store a programming project. No problem there since the majority of files are text (no matter the language).
But I have also documentation. What if I put it into a folder of the github project? (I have seen projects that do this)
I read git is no good for this, so how can I work versioning for say excel files?
You could save your excel as .fods, which is regular .ods file saved as flat XML. This format is probably not supported by MS Office, so you may need to install Libre Office for this (it is free).
Since .fods is regular XML, it can be versioned as regular text file with diffs and (with some luck) even support of merges between branches.
You could also save other Open Document formats as flat XMLs:
.fodt for word processing (text) documents
.fods for spreadsheets
.fodp for presentations
.fodg for graphics
So if migration to Libre Office is not a problem, this is probably the best solution.
If this is not an option, you may consider using Git LFS for storing binaries. But if files are small and you don't change them often, you can just ignore the whole problem - few small binary files will not hurt your repository. You should just estimate - if you will start versioning 1 MB binary file and save 100 versions of it, this will increase size of your repository about 100 MB (it could be smaller if file can be compressed). You need a really large codebase to reach 100 MB in repository with text source files only, so in this case your repository will be filled mainly by binary files.
BTW: GitHub released a tool for measuring size of git repository: git-sizer. It may give you some hints about potential problems with your repository.
//FIRST RUN THIS COMMAND
//npm install xlsx jsonfile
//CHANGE INPUT FILE NAME TO sample.xlsx and OUTPUT file is data.json
var XLSX = require('xlsx'),
request = require('request');
var fs = require('fs');
var jsonfile = require('jsonfile')
var file = 'data.json'
var buf = fs.readFileSync("sample.xlsx");
var wb = XLSX.read(buf, {type:'buffer'});
console.log(wb.Sheets);
jsonfile.writeFile(file, wb.Sheets, function (err) {
console.error(err)
})
Interesting question.Simple answer to it is, 'write some code to convert your excel file(.xls or .xlsx) to a json file and upload the content to git.
This idea is valid only for a simple excel sheet and not for complex ones involving a lot of math and charts.
I need to parse data from xlsx file. Currently I'm using Jakarta-POI (v. 3.11) to do that. It handles fine some xlsx but not all. I noticed that the files that are not parsed properly are "strict xlsx" files saved with Office 2013. To be more exact this files are compliant with ISO29500 not ECMA-376 the difference is that in ISO29500 file there are relationships with type:
http://purl.oclc.org/ooxml/officeDocument/relationships/officeDocument
and Jakarta-POI is looking for:
String CORE_DOCUMENT =
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Is there a way to make Jakarta-POI read this files?
OOXML Strict Converter for Office 2010 may help if you need to resave the docs using an older format.
Some of the purl namespaces are listed on http://pyxb.sourceforge.net/PyXB-1.2.2/bundles.html (Jethro's link above appears to no longer work).
The up to date XML schema files can be found at:
http://www.ecma-international.org/publications/standards/Ecma-376.htm
My system uses Apache-POI to manage some xls files. Now I've got almost 300 xls files, but it appears that they are in an old format so i got this exception:
The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)
Is there a way to handle that or to automatically convert all those files to a biff8 format?
Go with converting it to OOXLS format, POI supports both BIFF8 and newer OOXLS. Download official Microsoft converter pack:
http://www.microsoft.com/en-us/download/details.aspx?id=3
Convert files by running excelcnv.exe -oice <input file> <output file>. You can try run it directly from your code as external program, or create some batch file. There is a good explanation from mrdivo at social msdn here.
EDIT
The download mentioned above from microsoft.com is no longer available as of 6/21/2018. However, excelcnv.exe is a standard part of some Microsoft Office installations. It has been confirmed to be deployed with Office 2014 and Office 2016, and possibly other versions. It can be found at:
C:\Program Files (x86)\Microsoft Office\root\Office16` (or `Office14`).
It seems apache-POI can't handle BIFF5 format.
You should try to use Java Excel API instead : http://jexcelapi.sourceforge.net/
I want to identify the file-format of the input file given to my shell script - whether a .pst or a .dbx file. I checked How to check the extension of a filename in a bash script?. That one deals with txt files and two methods are given there -
check if the extension is txt
check if the mime type is application/text etc.
I tried file -ib <filename> on a .pst and a .dbx file and it showed application/octet-stream for both. However, if I just do file <filename>, then I get
this for the dbx file -
file1.dbx: Microsoft Outlook Express DBX File Message database
and this for the pst file -
file2.pst: Microsoft Outlook binary email folder (Outlook >=2003)
So, my questions are -
is it better to use mime type detection everytime when the output can be anything and we need a proper check?
How to apply mime type check in this case - both returning "application/octet-stream"?
Update
I didn't want to do an extension based detection because it seems we just can't be sure on a Unix system, that a .dbx file truly is a dbx file. Since file <filename> returns a line which contains the correct information of the file (e.g. "Microsoft Outlook Express DBX File Message database"). That means the file command is able to identify the file type properly. Then why does it not get the correct information in file -ib <filename> command?
Will parsing the string output of file <filename> be fine? Is it advisable assuming I only need to identify a narrow set of data storage files of outlook family (MS Outlook Express, MS Office Outlook 2003,2007,2010 etc.). A small text identifier like application/dbx which could be compared would be all I need.
The file command relies on having a file type detection database which includes rules for the file types that you expect to encounter. It may not be possible to recognize these file types if the file content doesn't have a unique code near the beginning of the file.
Note that the -i option to emit mime types actually uses a separate "magic" numbers file to recognize file types rather than translating long descriptions to file types. It is quite possible for these two databases to be out of sync. If your application really needs to recognize these two file types I suggest that you look at the Linux source code for "file" to see how they recognize them and then code this recognition algorithm right into your app.
If you want to do the equivalent of DOS file type detection, then strip the extension off the filename (everything after the last period) and look up that string in your own table where you define the types that you need.