What are known limitations of borb related to PDF versions? - python-3.x

I'm new to borb, which seems to me a very promising Python package.
Trying to load a small sample of PDF documents, just to put hands on, I've found that borb can open some of them without problems; in some cases I got messages such as "Unable to process XMP meta-data"; yet in other cases I got assertion errors.
Thus, before posting specific issues, I'm looking for information about current limitations of borb, with reference to PDF versions, and on tools I could use first to detect files to be considered invalid PDF documents. Thanks.
I'm using borb release v2.0.20, just cloned from GitHub, and Python 3.6.5 on Windows 10.

Disclaimer: I am Joris Schellekens, author of the aforementioned library borb.
The problem is that the PDF spec (ISO-32000) leaves some room for interpretation at various points throughout. That means some PDF libraries will interpret the spec in a given way, and produce documents that may not always be compliant according to other tools.
borb tends to be very strict when it comes to PDF parsing. As soon as an error is detected, it will throw the stacktrace right back at you. Whereas other PDF software (e.g. Adobe Reader) tend to be much more forgiving in terms of what they accept as input PDF documents.
Although I certainly understand your frustration at being unable to process what you perceive to be "perfectly good PDF documents", I assure you that processing them might lead to even more issues.
I know for instance that there are cases where Adobe Reader tries to correct a bad PDF document, and as a result ends up corrupting the signatures in the document (very undesirable).
If you experience issues, and you can share the PDF, feel free to log a ticket on the GitHub repository.
From the top of my head, the current limitations of borb are:
signatures
encrypted PDF documents
XREF not found
some images with transparent pixels

Related

Merging PDF files in Haskell

The Preview application on the Mac allows one to merge multiple PDF files, although the functionality is rather obscure. I'm writing a utility in Haskell that needs to perform a similar task, that is, merge an arbitrary number of PDF files into one new file.
Does anyone have a suggestion as to where to start with this? Obviously if there's a library on Hackage that will do most of the work out of the box that would be ideal, but if not, then some pointers about where to start would be very much appreciated.
I'm working on pdf library, that supports parsing and generating. It is low level, higher level tools are in todo list yet (because it is hard to design good high level API).
Here is an example of unpacking and decrypting of PDF file. It is easy to implement PDF merging, but you need to be familiar with PDF internals.
ADDED:
I create a basic example of merging PDF files in Haskell. 150 lines of code total, but it lacks few features (see comments at on the top of the file). They are easy to add, so let me know if you are interested.
The PDF file format isn't that complicated. Adobe has an official specification document for it somewhere. Essentially a PDF file contains a set of numbered "objects". You'd have to get all the objects from each PDF file, renumber them so they're unique, and then you need to fiddle with the page index so all the pages actually show up.
There appear to be a couple of packages on Hackage for writing PDF files, but I don't see much for reading them. You may like to look at the source code for pdfsplit for ideas. Also HPDF.

Viewing microsoft office files programaticly on linux

I want to code a desktop program to print microsoft office files (doc, docx, xls and xlxs) on linux machine. But I don't know how to print them without corruption on output.
Is there a way to print or convert to an other format the file as %100 same of the view on microsoft office?
The libreoffice API might be a good place to start, particularly the examples:
http://api.libreoffice.org/
I haven't used the API myself but have used open/libre-office as an alternative to word for quite a while.
However, you say '100%' the same as in office? I wouldn't be confident of that. Depending on the document it's likely to be fine, but there are some things which don't seem to convert well. If you're working on linux, you're not likely to have the same fonts installed as whichever windows/mac machine made the document.
If the documents you're processing are all of the same/similar layout/template, and you're able to test a few first, it should be fine. But if you're processing any sort of word document, some may not convert completely without a bit of human input. Depends how much difference you can tolerate. If you want completely consistent printing across platforms, I guess that's what pdfs are for.

SVG to PDF on a shared linux server

I have a website which uses SVG for an interactive client side thingamabob. I would like to provide the option to download a PDF of the finished output. I can pass the final SVG output back to the server, where I want to convert to PDF, then return it to the client for download.
This would need to work on a headless shared linux server, where installation or compilation is either an enormous pain, or impossible. The website is PHP, so the ideal solution would be PHP, or use software that's easily installed on a shared webserver. Python, perl and ruby are available, along with the usual things you might expect on a linux box. Solutions that involve cairo, scripting inkscape, or installation more complex than 'FTP it up' are probably out. Spending large amounts of money are also out, naturally. As this is a shared server, memory and/or CPU hungry solutions are also out, as they will tend to get killed; this more or less rules out Batik.
The nearest that I've got so far is this XSL transform which I can drive from PHP and then squirt the resulting postscript through ps2pdf (which is already installed). The only problem with this is that it doesn't support SVG paths - if it did, it would be perfect.
There are a bunch or related questions on StackOverflow, all of which I've read through, but they all assume that you can either install stuff, spend money, or both.
Does anyone have an off-the-shelf solution to this, or should I just spend some downtime trying to add paths support to that XSL transform?
Thanks,
Dunc
I stumbled across TCPDF today which would have been perfect for this, had I known about it at the time. It's just a collection of pure PHP classes, no external dependencies for most things.
It can build PDF's from scratch and you can include pretty much anything you want in there, including SVG (amongst many, many other things), as shown in these examples:
http://www.tcpdf.org/examples.php
Main project page is here:
http://www.tcpdf.org/
Sourceforge page is here:
http://sourceforge.net/projects/tcpdf/
You can use Apache FOP's free Batik SVG toolkit which has a transcoder api to transform SVG to PDF.
download link
You will need to write a tiny bit of java. There are code examples here – note you will need to set the transcoder to org.apache.fop.svg.PDFTranscoder instead of Java.
You should be able to do this without installing anything on your machine – just drag the jars on there and run a script. I quote:
All other libraries needed by Batik are included in the distribution. As a consequence the Batik archive is quite big, but after you have downloaded it, you will not need anything else.
have you looked at imagemagick? I suspect you also need ghostscript to complete the loop, which might make installation difficulty and performance a problem.
I'd suggest giving princexml a try, they provide various addons (including one for PHP) and can output PDF from SVG/HTML/XML.
i have used TCPDF (http://www.tcpdf.org/) in many projects and it work in almost every use case.
Here is the example of SVG: https://tcpdf.org/examples/example_058/
and following is the code which can help you:
$pdf->ImageSVG($file='images/testsvg.svg', $x=15, $y=30, $w='', $h='', $link='http://www.tcpdf.org', $align='', $palign='', $border=1, $fitonpage=false);
$pdf->ImageSVG($file='images/tux.svg', $x=30, $y=100, $w='', $h=100, $link='', $align='', $palign='', $border=0, $fitonpage=false);

Writing Color Calibration Data to a TIFF or PNG file

My custom homebrew photography processing software, running on 64 bit Linux/GNU, writes out PNG and TIFF files. These are to be sent to a quality printing shop to be made into fine art. Working with interior designers - it's important to get the colors just right!
The print shops usually have no trouble with TIFF and PNGs made from commercial software such as Photoshop. Even though i have the TIFF 6.0 specs, PNG specs, and other info in hand, it is not clear how to include color calibration data or implement color management system on linux. My files are often rejected as faulty, without sufficient error reports to make fixes.
This has been a nasty problem for a while for many. Even my contacts at the Hollywood postproduction studios are struggling with this issue. One studio even wanted to hire me to take care of their color calibration, thinking i was the expert - but no, i am just as blind and lost as everyone!
Does anyone know of good code examples, detailed technical information, or have any other enlightenment? Or time to switch to pure Apple?
Take a look at LittleCMS
http://www.littlecms.com/
This page has the code for applying it to TIFF
http://www.littlecms.com/newutils.htm
The basic thing you need to know is that Color profile data is something you need to store in the meta-data of the file itself.
There is a consultant called Charles Poynton who specialises in this area. I work for one of the post production studios you mention (albeit in london not hollywood), and have seen him speak on the subject a couple of times. His website contains a lot of the material he presents and you might find something of use there. He also has a book called Digital Video and HDTV Algorithms and Interfaces which is not as heavy as the title might suggest! While these resources might not answer your question directly, it might provide a spring board to other solutions.
More specifically, which libraries are you using to write the png and tif files - you mention they are homebrew, but how custom are they exactly? Postprocessing the images in an image manipulation program (such as ImageMagick or dcraw) might allow you to inject this information into the header more successfully.
Sorry, I don't have any specific answers, but maybe something that will point you a bit further in the right direction...
As a GNU/Linux user, you’ll want to consider DispcalGUI – http://dispcalgui.hoech.net/ – a GNOME-based GUI that centralizes color management, ICC profile management, and (crucially for your case) device calibration. It can talk to well-known pro- and mid-level hardware, e.g, i1, X-Rite, Spyder, etc.
But before you get into that – you say you are generating your files to spec; are you validating your output using a test suite specific to the format in question? If not, here are three to get you started:
imagetestsuite supports the well-known formats: https://code.google.com/p/imagetestsuite/w/list?can=1&q=
The Luminous* test suite is a JIRA plugin, if that’s your thing: https://marketplace.atlassian.com/plugins/com.luminouslead.plugin.jira.testsuite.LuminousTestSuite
FLOSS Decoder implementations often have one you can use, i.e. OpenJPEG – https://code.google.com/p/openjpeg/wiki/TestSuiteDocumentation
But even barring all of those, it seems like your problem is with embedded ICC data – which is two specs in one. First, there’s the host image-file format, and they all handle embedding differently (meaning the ICC data will likely look totally different when embedded in a TIFF than, say, a JPEG or WebP file). Second, there is the ICC spec itself. It is documented here: http://color.org/v4spec.xalter – and you may also want to look at the source for the aforementioned dispcalGUI, which includes a very legible and hackable ICC profile class in Python: http://sourceforge.net/p/dispcalgui/code/HEAD/tree/trunk/dispcalGUI/ICCProfile.py
Full disclosure: I have contributed to that very ICC profile class, to which I just linked in that last ¶
That’s the basics (many of which you have no doubt covered)... beyond that, if you post more information about what exactly is going wrong, I’d be interested to look it over. Good luck with it either way.
* NB. This project is unrelated to the long-standing photography website, “the Luminous Landscape”

Search by hash?

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.
This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.
More useful for non textual items like images, executables and archives.
I was wondering if there is already something similar?
Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.
In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.
The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.
I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.
The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.
As #rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.
Bitzi also produced software like Bitcollider (SourceForge.net),
and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.
The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.
Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.
See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).
Well, for images, there's http://tineye.com, which will one-up that, and find you similar images too.
It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.
If I understand your proposal right, http://bitzi.com/ has done this for a while.

Resources