Creating & Editing MS-Word documents on a linux server? - linux

Looking to develop server-side application that will process documents. The source documents are mostly MS-Word 2003, 2007, i.e. the MS version of Docx. Want the server application to be able to run on both linux or windows.
Wanting to know what is the best tool or library for reading and writing MS-Word files under linux. Compatibility is the most important consideration. Must preserve source document formatting including tables.
I have seen a kind of similar post here but it was specific to python. I don't care what language or libraries are used as long as they are available for windows and linux.
Must not require MS-Word to read the Word files.
I am aware of Open Office but am looking for a solution which has a high degree of compatibility with MS-Word files.
Also just came across this solution which looks promising. aspose.com
Anyone had any experience using Aspose.Words for Java or similar 3rd party packages? It looks promising but it's pricey at over $2K for an OEM subscription. That said if it delivers as advertised it may still be the best solution out there.
thanks
There have been a couple of suggestions but nothing so far which would fits the bill (or the budget).

Have you considered using b2xtranslator to convert binary .doc to .docx. (On Linux, you'd have to run it in Mono)
You could then use POI or docx4j to manipulate the docx. Not a solution if you need to save as .doc though (unless you use OO for that bit)

Ok, I'll have another go at an answer ;-)
What about using unaconv
It can convert any document OpenOffice can read to any document OpenOffice can write. You should be able to use that to convert both to/from MS-Word documents (providing they're not overly complicated which I've found open office can't handle very well).
The only caveat is that you need to have an instance of OpenOffice running on the linux server for unoconv to interact with.

Mono has recently acquired support for the system.io.packaging .net class, which allows some degree of manipulation of docx files. If the kind of thing you want to do is add/remove resources and recurse over the text, it's probably the right thing.

Related

ADO with XLSX files in Delphi XE

The issue here is ADO connection with Excel - is this still the standard way to read/write excel files within a Dephi XE environment? We're coming up with multiple issues when reading/writing using the ACEOLEDB driver (ACE 12) and this includes
Reading cells with hashtags don't return results
"Invalid Floating Point" when exporting grids.
We've also noticed that there's many versions of the ACE 12 driver out on Microsoft's website (via Access Database driver executables) and they each seem to have different issues with Delphi.
With these things in mind,
Is using ADO with Excel bad at this point?
Does anyone else have these issues and what did you do to resolve them (other than using XLS files instead of XLSX)?
ADO in Delphi is leaning to TDataSet model, which mean strictly tabular data... that excel is not. Each excel sheet has a random cells filled, some of those may constitute quazi-tabular ranges, or may not.
Depending on the installed software you can
1) use Excel application to open XLSX, read the cells and pass them to your program. This is most easy and compatible method, though is noticeably slow due to COM IPC marshalling and switching. There are tricks to fasten it, like hiding Excel window, copying arrays of data instead of cell-by-cell approach and such.
Start exploring TExcelApplication component - http://docwiki.embarcadero.com/RADStudio/XE3/en/Using_Component_Wrappers
2) If you do not want to rely on having commercial Excel installed, you may try reading XLSX files with OpenOffice. Vanilla OpenOffice can only read them though, but some other distro's can write them as well. OpenOffice also exposes external APIs both COM-based and HTTP-based. I know there are Delphi projects of Delphi - OOo interacting but personally did not used them and apart of noting that approach i can say no detailed assesment of it.
3) Microsoft also used to sell Office for Developers or such, that gave you Access and Excel kernels as redistributables, that you could pass with your application and install them and use them. Dunno if it is still feasible though.
4) there is a set of commercial components reading and writing those files directly w/o need to have external EXE's doing the job. While that would be the most fast way to work, it would only support some subset of features (which may or may not be ok for your particular goals) and may have troubles with "future compatibility" as Microsoft would roll out updated versions of XLS and XLSX formats (which again may be of some or none concern to you). Like there was TXLSFile for Biff8 format, there is for example OExport library. There is also a component from well known TMS Studio and maybe some more.
5) You can join some open-source project and try to enhance it for your needs, then again that depends upon how much the subset you need.
I know, many people succefulyl use OLE DB to access Excel data, but for me it always sounded as some perversino, because Excel files do not have any internal regular data arrangement at all, less so strictly-tabular RDBMS-like one.
I've really only ever found it possible to manipulate Excel via COM. I've tried alternatives, like ADO, but they always seem so full of archane bugs - or maybe it's just my ignorance.
COM is definitely slow in certain areas. I've used a combination of COM and (within the Excel file) VBA to achieve what I need to do.
Given that Excel is NOT going to go away BUT Microsoft cannot be relied upon not to betray its users by, for instance, doing away with VBA and COM support, it would be great if someone, somewhere (and I wish I had the skills) could create some proper support for Excel from Delphi.

Tracking Excel files in Version Control

We are branching out beyond the development team and trying to get other groups within my company to use version control for important documents that need change tracking. One frequent need is for Excel spreadsheets. These are large spreadsheets, modified fairly frequently (weekly or monthly) but with only a small portion of the cells changed each time.
Just sticking the files in subversion (the particular tool we are using) gives a history of changes and keeps old versions. And the TortoiseSVN client makes it easy for non-technical users. Recent versions of TortoiseSVN even contain a script which can be used to perform nice visual diffs between Excel documents.
My remaining concern is disk space. These are large documents. The diffs between versions are small, but I worry that the version control will notice that the file is binary and fall back to storing each version separately. Does anyone know of a solution to this? For instance, a format we could save in in which the diffs would be small so only differences would be saved, or a version control system which is specifically aware of Excel files? I have not yet done performance testing, but our version control server is already badly taxed and if there is a better solution I'd love to know what it is.
Currently SVN cannot efficently store those types of files. There has been some discussion about it though
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=651443
This SO question shows a graph when storing an OpenXML office document. The results were pretty linear
Will Subversion efficiently store OpenXML Office documents?
Although your question wasn't specifically about that format it may still apply. You might just need to run a test in SVN and see what kind of storage it takes. SVN is pretty good at storing binary files, so it might not be too terrible. The SO question above also mentions saving the file as a plain text XML 2003 document, which you might investigate also.
One consideration is using Team Foundation Server for source control (if that's an option), which will just store your delta changes, although it may be a bit heavy for what you're looking for.
From my understanding, binary vs. text doesn't have an impact on the storage size in SVN: http://help.collab.net/index.jsp?topic=/faq/svnbinary.html

Product backlog management using Excel (or Excel's XML Spreadsheet 2003 feature with SVN merging)

We have Product Backlog in an Excel Spreadsheet that we also commit to SVN, so everyone can open use it and update to latest version.
The problem we have is:
How do you enable Excel spreadsheet to be simultaneously used by many people and not override each other's data. Some kind of merging data? Is it at all possible?
We would like to keep our data in Excel spreadsheet since it provides all the functionality we need. We tried Google spreadsheets that are much better in terms of collaboration, but they don't support cell drop-down values...
EDIT
I've found out that I can also save my Excel file in XML format (XML Spreadsheet 2003). This format preserves all formating, formulas, conditional styles etc. Only graphs fall out, but I can live with that. I suppose that SVN Merge tools more or less support XML file merging when multiple people work on it.
So I thought of this feature, but I don't know how much does Excel change XML document between saves. Maybe it reorders a lot of stuff hence making it impossible for merge tools to work with expected results. Anyone has any experience with this?. It would basically make it possible for multiple people working on the same file.
EDIT2
I've tried managing Excel spreadsheets in a plain XML file (XML Spreadsheet 2003 file format) that makes it possible for SVN to merge it since it's a text file. I've learned to not use conditinal formatting, because it doesn't always work, so you'll have problems reopening the file. Also graphs won't be preserved.
Testing outcome: I've tried simultaneous work on the same XML, but Excel works with these files in no particular order (looks like it converts it to Excel and back to XML when saving it), so even if you make a small change to data, your XML will look completely different. So this format is a no go for collaborative purposes with merge capabilities.
FYI note: I've moved my operations from Excel to a more collaborative and similarly capable solution: Google Spreadsheets. Simultaneous collaboration is just working and working great (hopefully Microsoft will someday make collaboration this way on all its Office range) and it supports versioning as well as all the capabilities I'm using in Excel. COnditional formatting and with some additional script code I can use conditional formatting on whole rows based on single cells as well, so I can easily background colour whole rows of when set particular story status to "Completed".
Excel data can be merged but it's somewhat hard and error prone.
What I've found useful is to put the svn:needs-lock property on that kind of hard-to-merge files. The file will be read-only by default to communicate "do not edit this" and acquiring a svn lock makes it a read-write file, while preventing other users from acquiring a lock at the same time. It's a communication mechanism, not a 100% foolproof merge conflict prevention mechanism.
More information in svnbook.
Excel has a built-in share/merge capability (search the help for "share workbook") but it's meant for live copies on shared drives. Still, for this particular problem that might be a better solution than SVN.
Excel doesn't support the scenario you described, you need to use a multi-user database, or an proper work-tracking system. There are some hosted ones you can try (Jira, FogBugz) but you're not going to get multi-user with Excel (unless you use the shared spreadsheet option, but as David Moles said, that's meant for shared xls files on a file server).
SVN allows eveyone to get the latest version and work on the latest version without a "checking out" / "make read-only". This enables multi-users working the same file, and would assume that the Merge features only work on Text docuements (code files)
(This is based on my possibly limited experience with SVN)
Its probabaly time to upgrade from an Excel system to a more systematic multi-user database system.
I wonder if you're using the right tools for the job. Perhaps you should be investigating some Scrum tools to help you manage your projects.
Excel works well in a simple, non-shared scenario but I would investigate something more powerful to allow your team to manage the scrum process simultaneously.
After hearing about this project on several podcasts, I've started using Zen. It's based on a kanban board.
I'm pretty sure TortoiseSVN has some hooks (merge-docx.js, et al) which enable you to merge complex file types like office documents.
Or perhaps something like this (at a push)
have you looked at the Greenhopper planning backlog tool.
You could try XLLoop. This lets you write excel functions (UDFs) on an external server. It includes server implementations in many different languages.
So instead of keeping the data on your spreadsheet, you could have it stored in an external database then create a template excel sheet that calls the functions on your server. This will allow the data to be automatically updated and also allow multiple people to use the sheet at the same time.
BTW, I work on the project so let me know if you have any questions.

How to manage application resources?

We are developing a web application which is available in 3 languages.
There are these key-value pairs to translate everything. At this moment we use Excel (key, german, french, english) for this. But this does not work well ... if there is more than 1 person editing this file, you have no chance to automatically merge the different files.
Is there a good (and free) tool which can handle this job?
--- additional information ---
(This is a STRUTS application) But the question is how to manage these kinds of information in general (or at least in an conveinient way, which also supports multiple users editing this single file ("mergeable" filetypes))
Why not use gettext and manage separate .po files? See that blog entry.
If you can store this information in plain text then you will be able to use a version control system like subversion to help you with merging changes. Subversion is free.
The free guide (the "Red Book") to subversion gives a fairly good explanation of how this kind of merging works.
http://svnbook.red-bean.com/en/1.5/svn.basic.vsn-models.html#svn.basic.vsn-models.copy-merge
EDIT: Another thought - if you really want to stay using a spreadsheet - Google Docs supports simultaneous editing of a spreadsheet. You could import your existing spreadsheet and get your multi-user merging wishes for free with very little change to how you work.
Good Question.
There are some "Best Practice" depending on what you actually code in (java, ms-windows c#).
I solved this (but I think there must be a better way) by using a SQL db instead of excel file, and a wrote a plug for VS (VB6,........,..., emacs) that was able to insert new keys into the db without going to round trip with version control. The keys are the developers name of what they think is a best guess for a label. (key => save, sv => "spara", no => "", en => "save").
This db can then be generated as a module, class, obj, txt, to appropriate code(platform)
and can be accessed, depending on the ide, so in c#, bt,label = corelang.save;
Someone else can then do all the language stuff, and then we just update the db and rerun the generation to the platform resources.
After years of seeing localization done, including localization at large companies like Sony. I can only say the "standard" is Excel :)
There are tons of good ideas around, and probably many better ways to do it, but in real-life excel seems to be the best/cost effective solution that doesn't require training or making complex new tools to get the job done.
Found out, that Intellij Idea (at leas in version 7 and 8) has an editor for application resources. But it is not free at all. And it does not scale for bigger resource files with more than 1.000 keys.
Another good choice would be to use Google's spreadsheets ... for those who don't know it - it is like an "online Excell web-application". It can handle concurrent access from multiple users. Yay! But sadly, it comes from Google. This makes it impossible to be used in commercial projects.
So,
still searching...
cheers,
mana

What is a good web-based Grid that accepts Excel clipboard data?

Any good recommendations for a platform agnostic (i.e. Javascript) grid control/plugin that will accept pasted Excel data and can emit Excel-compliant clipboard data during a Copy?
I believe Excel data is formatted as CSV during "normal" clipboard operations.
dhtmlxGrid looks promising, but the online demo's don't actually copy contents to my clipboard!
I'm currently using dhtmlxGrid and we have the Excel copy/paste functionality working. dhtmlXGrid is the most full featured javascript grid package that I've found.
On their website, dhtmlXGrid claims to support Clipboard functionality in the Professional version. (However, I noticed the Sample on their site isn't working on my Firefox. EDIT: It's probably the permissions issue that Nathan mentioned.)
In any case, we had to do some extra work to get the exact Excel copy and paste functionality we wanted. We essentially had to override some of their functionality to get the desired behavior. Their support was pretty good in helping us come up with a solution.
So to answer your question, you should be able to get them to support copy and paste if you purchase the Professional version. I'm just warning you that it may take some additional work to fine tune that behavior.
Overall, I'm happy with dhtmlXGrid. We use a lot of their features. Their support is pretty good. They usually take one day to respond since they are in Europe (I think). And Javascript is by its very nature open source so I can always dive in when I need to.
Not an answer, but a warning: my company bought the 2007 Infragistics ASP.NET controls just for the Grid, and we regret that choice.
The quality of API is horrible (in our opinion at least), making it very hard to program against the grid (for example, inconsistent naming conventions, but this is just an inconvenience, we have complaints about the object model as well).
So I can't say that I know of a better option, I just know I will give a try to something else before paying for Infragistics products again (and the email support we got was horrible as well).
I was wrestling with this problem several years ago (2004 I think). We ran into the problem that Firefox doesn't allow scripts to read the clipboard by default (but you can grant access to the clipboard).
There's other ways of reading the clipboard data as well...Flash, for instance, can read the clipboard. There's a good article on ajaxian to explain how do to this behind the scenes.
In the end, we couldn't find a web-based Grid that fit the bill, so we had to create our own in a mixture of Actionscript and Javascript.
I'd hate to be Captain Obvious here...but what about a plain old .NET Gridview control? You can copy Excel data into it and out of it...and you can run it on any system with the .NET platform installed.
http://dhtmlx.com/dhxdocs/doku.php?id=dhtmlxgrid:clipboard_operations

Resources