I want to code a desktop program to print microsoft office files (doc, docx, xls and xlxs) on linux machine. But I don't know how to print them without corruption on output.
Is there a way to print or convert to an other format the file as %100 same of the view on microsoft office?
The libreoffice API might be a good place to start, particularly the examples:
http://api.libreoffice.org/
I haven't used the API myself but have used open/libre-office as an alternative to word for quite a while.
However, you say '100%' the same as in office? I wouldn't be confident of that. Depending on the document it's likely to be fine, but there are some things which don't seem to convert well. If you're working on linux, you're not likely to have the same fonts installed as whichever windows/mac machine made the document.
If the documents you're processing are all of the same/similar layout/template, and you're able to test a few first, it should be fine. But if you're processing any sort of word document, some may not convert completely without a bit of human input. Depends how much difference you can tolerate. If you want completely consistent printing across platforms, I guess that's what pdfs are for.
Related
We moved to Sharepoint Online (SPO) this year and are collaborating heavily in Excel365. The experience is something of a disaster and I am hoping that I might get some responses here about "best practice" or peoples' experiences in that environment.
Observation #1:
All Excel's now open with AutoSaveOn=True. This is a total disaster because people often open Office files to simply take a look. Early users were horrified when they realized they had accidentally changed an archive file. We addressed this issue by setting AutoSaveOn=False in Auto_open.
(The AutoSaveOn=True experience requires a complete rethink of accustomed operating patterns. The established pattern when changing a file is: 1) open the original; 2) make changes; 3) save under a new name. With AutoSaveOn=true, you must: 1) make a copy of the file in the file manager; 2) open the copied file; 3) make changes. We have not managed to retrain our user base.)
We now get a different issue. In a network environment where 5 users have an Excel open purely for reference, the 6th user who is editing that file often has to fight against Office/SPO to somehow get it saved. There exists a concept of a "file lock" and someone has it. Who is unclear, and releasing it seems next to impossible. Opening an Office file "read-only" on purpose does not seem to be a concept Microsoft recognizes.
Observation #2:
All our production Excels are XLMs, not least because of the Auto_Open above. That is, working Online is not an option. (Which may be just as well because the online experience does not compare for the Excel pro.) All our users therefore synchronize relevant SPO archives to their OneDrive (OD). This should be a good thing anyway, since we are also nomadic folk, often working from the road when there is no internet. SHOULD NOT be a problem, right?
Right ... Turns out that Excel surreptitiously replaces links to other Excels on the OD with links to SPO such that when you hit the road, Excel will just hang up trying to access stuff on an unavailable Internet/SPO :( The user is forced to repoint all links manually to his OD to make it work.
Except it won't - because the Auto_Open now fails. Turns out, AutoSaveOn is not a valid property in the absence of a network connection. It seems that there are two code bases of Excel. One that is invoked with a network present, the other when there is none.
Observation #3:
When the user has somehow survived the horrific offline experience and returns to a network, the spreadsheet links fail again. Excel now throws a completely incomprehensible hissy-fit, complaining that none of the OD files exist, even though they are patently there. The only way I know to cure this issue is by going thru all links again and repointing them to the identical paths. (Behind the scenes, Excel of course uses this dialog to replace links to OD with links to SPO.)
Conclusion:
It seems that the modern Excel really wants to work in AutoSave mode but Microsoft doesn't really manage the experience. There is also no transparent switching possible between working offline and working online. All of this appears to be owed to two different code bases - online and offline - trading under the same name "Excel". We do not really require the online experience. It would be perfectly adequate for our purposes to work with the offline code on OD only, and OD can update SPO when a network is present.
Question: Does anyone know how to fool Excel into using the offline code base even in the presence of a network connection?
Any other experiences or pointers?
I'm new to borb, which seems to me a very promising Python package.
Trying to load a small sample of PDF documents, just to put hands on, I've found that borb can open some of them without problems; in some cases I got messages such as "Unable to process XMP meta-data"; yet in other cases I got assertion errors.
Thus, before posting specific issues, I'm looking for information about current limitations of borb, with reference to PDF versions, and on tools I could use first to detect files to be considered invalid PDF documents. Thanks.
I'm using borb release v2.0.20, just cloned from GitHub, and Python 3.6.5 on Windows 10.
Disclaimer: I am Joris Schellekens, author of the aforementioned library borb.
The problem is that the PDF spec (ISO-32000) leaves some room for interpretation at various points throughout. That means some PDF libraries will interpret the spec in a given way, and produce documents that may not always be compliant according to other tools.
borb tends to be very strict when it comes to PDF parsing. As soon as an error is detected, it will throw the stacktrace right back at you. Whereas other PDF software (e.g. Adobe Reader) tend to be much more forgiving in terms of what they accept as input PDF documents.
Although I certainly understand your frustration at being unable to process what you perceive to be "perfectly good PDF documents", I assure you that processing them might lead to even more issues.
I know for instance that there are cases where Adobe Reader tries to correct a bad PDF document, and as a result ends up corrupting the signatures in the document (very undesirable).
If you experience issues, and you can share the PDF, feel free to log a ticket on the GitHub repository.
From the top of my head, the current limitations of borb are:
signatures
encrypted PDF documents
XREF not found
some images with transparent pixels
I'm starting a project at work where the workers are supposed to get a scanner to scan barcodes on the vares that they use. Optimaly we would have a system supporting this, but we don't.
My thought is to be able to have excel running in the background on the computer they use to several other things, like reading newspapers and looking up todays weather etc. My understading of scanners is that they work just like a keyboard when connected to a computer, problems may then arise if someone is scanning barcodes, and another one is reading the newspaper in internet explorer, maybe the barcodes pops-up as a number in the URL(?), when it really should go to a specific cell in excel.
My question: Is it possible to make a scanner always return its values(scanned barcodes) to excel, EVENTHOUGH the computer may be used to something else at the same time?
Thanks for every thought and comment!
Have a nice weekend!
I do not think Excel would be the best solution to achieve this. It may be possible to achieve by linking to the scanner API and leveraging external libraries to listen for the scanner port etc. However, these kind of applications best be installed as system services e.g. Windows Service or as any other background application in .NET, Java, Python whatever. Excel is not the first choice technology to do these sort of things. Excel, however, can well be used for outputing this data.
What is more, honestly, the solution and feasibility will depend on the scanner API or driver.
BACKGROUND
I am using a commercial application on windows that creates a drawing
This application allows only two output options: (1) save as a bitmap file and (2) print to a printer
the bitmap is useless for my purposes - I want the vectors
Looking at the print output (I sent to the Windows XPS print driver) it seems clear based on the amount of zooming I can do without loss of detail that the underlying vectors are being send to the print driver
Once I get the vectors, I will be writing some code to transform them for some other use.
MY QUESTION
Whart are my options for geting the vectors from the print? (am open to both commercial and open source)
OPTIONS I HAVE THOUGHT OF SO FAR
Take the bitmap and use a program like VectorMagick to. I have tried this approach. It does not produce the fidelity I seek even when the original bitmap is large. Practically speaking I believe that using any tracing approach will not give me the quality vectors I need.
Print to the Adobe PDF driver. This technically works. I have Adobe CS4 so I can print to it save the resulting PDF and then import the PDF into Illustrator and then export as some other vector format. The problem with this approach is money/licensing. I own a personal copy of Adobe CS4 - so this is fine for me. But I need to capture the vectors at work for business purposes - and no I'm not going to install my personal copy of CS4 at work.
Is there a "print driver" that captures the print output directly into a vector format? I have seen some commercial ones via google. If you've used them, I would like to hear about your experience with this technique. I could write my own and in that case do you have links to any existing code that I can start with.
If this is an ongoing solution you need then you might need to buy something or build your own. If it's a onetime affair you might look to use an 'older' Lexmark PCL printer driver. I'd recommend something like the T610. If you download the PCL driver and install it you can modify the defaults and change the Graphics option from XL or Autoselect to GL/2. This will force the driver to output GL/2 output which is vector (GL/2 is a plotter language). This might do the trick for you. Other printer drivers may have the abiltiy to force GL/2 (vs. Raster) but I'm not sure. I use to work for Lexmark and have used this before for a similar requirement.
Ensure you use the Lexmark 'Custom' driver as I don't think the Microsoft-based one support this feature.
...pausing while I investigate a few things............I'm back...
Another option is to find another GL/2 driver or build you own...I just took a few minutes to search the web and came up with a few other options that might work.
Build you own:
I've built drivers (minidrivers) using the Windows Driver Development Kit (DDK), it's quite simple to construct basic drivers. Looks like there is a setting you can set to enable GL/2 output: Enabling HP-GL/2 Vector Graphics Support (PCL-5e) in the GPD
Alternate drivers:
Depending on the OS you are on there is probably a 'generic' GL/2 driver built in. I believe XP has a Hewlett-Packard HP-GL/2 Plotter. You might need to check the license (as with the Lexmark solution) but it might work for you and as it's part of the OS there shouldn't be concern about using it. It's probably written and copyrighted to Microsoft
Keep in mind you will have to do some work to convert GL/2 to whatever output you want but it should be a matter of an simple translator to convert each set of commands. There may be tools out there to help. Here is a quick link to Lexmark GL/2 reference which might be enough to get you going, check out the GL/2 information under the PCL section: Lexmark Technical Reference Guide
Postscript:
The last option I have is to use a generic Postscript driver. Postscript should output the vector images as vector graphics in the Postscript but my knowledge of this is limited at best.
Output:
If you need the output to route to file you can set the port to FILE: which requries user intervention, or install something like Redmon (or connect with me and I'll send you our port monitor that allows for automatic output to file).
Hope this helps in some way.
My favorite is the open source (GPL) PDFCreator
http://sourceforge.net/projects/emfprinter/
I recently learned about the basic structure of the .docx file (it's a specially structured zip archive). However, docx is not formated like a doc.
How does a doc file work? What is the file format, structure, etc?
It's not a direct answer to your question, but I highly recommend reading Joel Spolsky's article, Why are the Microsoft Office file formats so complicated? (And some workarounds). It will give you some insight into how complex the .doc format really is - and why. Joel also gives a very basic overview of what the .doc format consists of:
You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file
systems inside a single file. These are sufficiently complicated that you have to read
another 9 page spec to figure that out. And these “specs” look more like C data
structures than what we traditionally think of as a spec. It's a whole hierarchical file
system.
(The quote refers to Excel files but it applies to Word docs as well). Informative article and helpful in understanding why .docx and ODF files are structured and designed so much more logically when being examined from an outside perspective.
The full format for binary .doc files is documented in this pdf from (the Wikipedia article on .doc)
The basic idea behind the MS Word DOC format is an OLE Compund Document which, as Kibbee has already written, is basically a memory dump. It's a very complex and convoluted way to store documents, but if you've ever really dug into the application Word you'll know how insanely many features it has, and if you have used it in a business setting you'll have a good feeling for how it integrates with other programs in the Office series.
In general, OLE Compund Documents are very extensible structures that allows you to stuff all kinds of data into one file and even to some degree handle data you don't have an application installed for. For example, if you insert an Equation object (from the MS Equation Editor) into a document it gets stored as a sub-object which is like a file inside the file, but this object doesn't just contain the data required for Equation Editor to edit and render it, it also has a generic bitmap (or metafile, maybe) representation stored so it can be displayed, though not edited, on a machine without Equation Editor installed.
This was the why, for the how you'll have to read the specifications other people have linked to already ;)
If you want the easy way out to work with the files though, make sure your software runs on a Windows machine with Word installed, then use COM/OLE Automation to open and manipulate the documents. You won't have to worry about file format then.
Doc is the binary format of word document - here's the Microsoft Office Word 97-2007 Binary File Format Specification [*.doc] document.
The .doc format is quite complex. Like most Microsoft formats, it reflects a long history of changes between versions and legacy support. They published it not too long ago, so if you want to view it (and other pre-Office 2007 formats), knock yourself out here.
There's Microsoft Word's .doc and then there's plain text .doc. It sounds like you're wondering about the proprietary Microsoft format.
From Wikipedia:
The DOC format varies among Microsoft Office Word Formats. Word versions up to 97 used a different format from Microsoft Word version between 97 and 2003.
It wasn't until Word 2007 where .docx, although a packaged file, is not necessarily a .zip archive. It is a structured XML document.