fast xls conversion on linux - linux

I need to extract text from an xls file for purpose of indexing. This is just an indexing requirement so formatting , macros sheets etc are of no concern,
using xlhtml is very heavy and heavily brings down the indexing capability.
What is the best way for this conversion ?

I used tika http://tika.apache.org/ as a daemon.
Works much better now

Related

What is the common knowledge about NPOI, EPPlus and Koogra as of 2015?

Yes, Koogra only reads. EPPlus only supports .xlsx and is buggy in edge cases.
What else should one know for choosing between them?
Is one of them much slower than others?
NPOI seems way to complicated and is a Java port, so is it worth
using?
Should one use EPPlus for .xlsx and NPOI for .xls?
What is the general knowledge about them today?
Jet/ACE OLE DB either read worksheet as strings, or as typed columns, so you either lose numbers precision or you must have headers in the first row. Thus, they are to be avoided.
No library supports XLSB.
Speed.
For a large XLS, the time of reading for NPOI:Jet:Koogra:EDR is 14:8:7:5.
For the same XLSX, the time for EPPlus:NPOI:Koogra:EDR is 52:36:20:16.
For relatively small files with many tabs EPPlus can be a bit faster than EDR.
Errors (#DIV/0!, #VALUE!) etc.
EDR and Koogra don't explicitly support errors. EDR reads them as usual strings, Koogra -- as blank cells.
NPOI and EPPlus do.
Koogra reads dates as [OLE date] numbers and they are undistinguishable from real numbers. Also it sometimes reads numbers with many decimals digits incorrectly. EDR gets this fine. So, no to Koogra.
NPOI is complicated, 5 dlls of 4 MB. Koogra and EDR are simple, 200 KB and two dlls (themselves and zip) each.
EDR works as a IDataReader, so it reads data sequentially. It also has built-in function to get a DataSet. With sequential read yoou can only go through first sheet in the work book. Koogra supports random access to cells and sheets.
EDR is based on SharpZip, Koogra is based on Ionic.Zip. The former allows to open a file from .zip a Stream which can be benefical for other parts of the project.
I haven't looked at writing aspects of NPOI, so without the need to distinguish errors, I would go with EPPlus for .xlsx and with EDR for reading .xls.

Merging PDF files in Haskell

The Preview application on the Mac allows one to merge multiple PDF files, although the functionality is rather obscure. I'm writing a utility in Haskell that needs to perform a similar task, that is, merge an arbitrary number of PDF files into one new file.
Does anyone have a suggestion as to where to start with this? Obviously if there's a library on Hackage that will do most of the work out of the box that would be ideal, but if not, then some pointers about where to start would be very much appreciated.
I'm working on pdf library, that supports parsing and generating. It is low level, higher level tools are in todo list yet (because it is hard to design good high level API).
Here is an example of unpacking and decrypting of PDF file. It is easy to implement PDF merging, but you need to be familiar with PDF internals.
ADDED:
I create a basic example of merging PDF files in Haskell. 150 lines of code total, but it lacks few features (see comments at on the top of the file). They are easy to add, so let me know if you are interested.
The PDF file format isn't that complicated. Adobe has an official specification document for it somewhere. Essentially a PDF file contains a set of numbered "objects". You'd have to get all the objects from each PDF file, renumber them so they're unique, and then you need to fiddle with the page index so all the pages actually show up.
There appear to be a couple of packages on Hackage for writing PDF files, but I don't see much for reading them. You may like to look at the source code for pdfsplit for ideas. Also HPDF.

plot the graph using different Excel files

I have some data stored in different Microsoft Excel Worksheet (.xlsx) .
Now i want to ploat a graph by using these data (which are in different .xlsx) files. How can i do this ? means which language or platform i should use or any other help related to that.
You can use Java API's to create graph if you are familiar with java. You can use Apache POI or jfreechart or openxmlformats(api provided by apache).
Matlab has the built-in function xlsread which parses data from Excel files. Depending on how the files are organized, writing some code to read them all should be easy, and concatenating the matrices of data and plotting them is also pretty easy.
if you use saveAs to save them as CSV files, you can read them in very easily to Java and split the values with String.split(","). If you want to keep them as .xlsx files, JExcelApi might be able to help, though personally I've never used it. As for graphing, I'd recommend Javaplot/gnuplot, Jzy3D, or JMathPlot depending on what you want to graph. They're free and relatively easy to use.

How to read excel(2007+ xlsx) sheet using actionscript(AIR)?

How to read excel(2007+ xlsx) sheet using actionscript(AIR)?
as3xls
An Actionscript 3 library for reading and writing Excel files. Currently reading numbers, text, and formulas from Excel version 2.0-2003 and writing numbers, text, and dates to Excel 2.0 is supported. No server-side help is needed.
SUPPORT INFORMATION
Documentation and samples are at http://code.google.com/p/as3xls/
I wrote this: https://github.com/childoftv/as3-xlsx-reader I'd love to know if it helps
Do you have any idea how... Inefficient this is?
Excel uses a complex setup for files, and unless you want to write a full-scale parser for its spreadsheets (which, believe me, will be difficult, alone to figure out what the format chars do), you'd be better off finding another solution.
Say, using a "save to XML" option would make your job a few thousand times easier, without exaggeration. AS3 has no native support for Excel, there is no real point for it to have such. But it has great integrated methods for working with XML.
If possible, save the Excel files to XML and parse those.
Better still, use databases, and parse them as XML through PHP.
I did a search and came up with this: http://code.google.com/p/php-excel-reader/
Once you've got it in PHP, passing it on to Flash is no problem at all. I'd recommend turning it into straight arrays of objects and converting it to AMF3 via Zend_Amf, AMFPHP or WebOrb, whichever one you're most comfortable with. You can then create tables, manipulate the data or whatever you like. It'd also be a lot faster and lighter than using XML.
PK
I took a look at the xlsx breakdown and it would take me 1 week to write an xlsx writer that could do basic formatting and formulas. I've only spent 1 hour perusing through the directories in an xlsx file and all you'd have to do is create the same directory structure...mostly cut and paste some strings..and then zip it and call it xlsx.
I tried this theory by manually making an xlsx file using 7zip. I downloaded childoftv's reader and, though I don't need the reader, the package includes a few zip/unzip classes that would prove helpful for anyone who wants to make a xlsx writer.
Long story short, the setup isn't complex, somebody just has to take a week out of their busy schedule to do it. I need this functionality so if nobody's done it yet, then I'll have to. Hopefully my search will find something better than a forum where the general consensus is "it's too hard, give up."

How does the .doc format work?

I recently learned about the basic structure of the .docx file (it's a specially structured zip archive). However, docx is not formated like a doc.
How does a doc file work? What is the file format, structure, etc?
It's not a direct answer to your question, but I highly recommend reading Joel Spolsky's article, Why are the Microsoft Office file formats so complicated? (And some workarounds). It will give you some insight into how complex the .doc format really is - and why. Joel also gives a very basic overview of what the .doc format consists of:
You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file
systems inside a single file. These are sufficiently complicated that you have to read
another 9 page spec to figure that out. And these “specs” look more like C data
structures than what we traditionally think of as a spec. It's a whole hierarchical file
system.
(The quote refers to Excel files but it applies to Word docs as well). Informative article and helpful in understanding why .docx and ODF files are structured and designed so much more logically when being examined from an outside perspective.
The full format for binary .doc files is documented in this pdf from (the Wikipedia article on .doc)
The basic idea behind the MS Word DOC format is an OLE Compund Document which, as Kibbee has already written, is basically a memory dump. It's a very complex and convoluted way to store documents, but if you've ever really dug into the application Word you'll know how insanely many features it has, and if you have used it in a business setting you'll have a good feeling for how it integrates with other programs in the Office series.
In general, OLE Compund Documents are very extensible structures that allows you to stuff all kinds of data into one file and even to some degree handle data you don't have an application installed for. For example, if you insert an Equation object (from the MS Equation Editor) into a document it gets stored as a sub-object which is like a file inside the file, but this object doesn't just contain the data required for Equation Editor to edit and render it, it also has a generic bitmap (or metafile, maybe) representation stored so it can be displayed, though not edited, on a machine without Equation Editor installed.
This was the why, for the how you'll have to read the specifications other people have linked to already ;)
If you want the easy way out to work with the files though, make sure your software runs on a Windows machine with Word installed, then use COM/OLE Automation to open and manipulate the documents. You won't have to worry about file format then.
Doc is the binary format of word document - here's the Microsoft Office Word 97-2007 Binary File Format Specification [*.doc] document.
The .doc format is quite complex. Like most Microsoft formats, it reflects a long history of changes between versions and legacy support. They published it not too long ago, so if you want to view it (and other pre-Office 2007 formats), knock yourself out here.
There's Microsoft Word's .doc and then there's plain text .doc. It sounds like you're wondering about the proprietary Microsoft format.
From Wikipedia:
The DOC format varies among Microsoft Office Word Formats. Word versions up to 97 used a different format from Microsoft Word version between 97 and 2003.
It wasn't until Word 2007 where .docx, although a packaged file, is not necessarily a .zip archive. It is a structured XML document.

Resources