Merging PDF files in Haskell - haskell

The Preview application on the Mac allows one to merge multiple PDF files, although the functionality is rather obscure. I'm writing a utility in Haskell that needs to perform a similar task, that is, merge an arbitrary number of PDF files into one new file.
Does anyone have a suggestion as to where to start with this? Obviously if there's a library on Hackage that will do most of the work out of the box that would be ideal, but if not, then some pointers about where to start would be very much appreciated.

I'm working on pdf library, that supports parsing and generating. It is low level, higher level tools are in todo list yet (because it is hard to design good high level API).
Here is an example of unpacking and decrypting of PDF file. It is easy to implement PDF merging, but you need to be familiar with PDF internals.
ADDED:
I create a basic example of merging PDF files in Haskell. 150 lines of code total, but it lacks few features (see comments at on the top of the file). They are easy to add, so let me know if you are interested.

The PDF file format isn't that complicated. Adobe has an official specification document for it somewhere. Essentially a PDF file contains a set of numbered "objects". You'd have to get all the objects from each PDF file, renumber them so they're unique, and then you need to fiddle with the page index so all the pages actually show up.
There appear to be a couple of packages on Hackage for writing PDF files, but I don't see much for reading them. You may like to look at the source code for pdfsplit for ideas. Also HPDF.

Related

Producing PDF files in NodeJS - simpler than puppeteer/chromium but a bit less basic than low level libraries

I'd like to be able to produce PDF files in NodeJS.
Currently, we use puppeteer. We need to produce highly designed documents and so puppeteer/chromium gives me the ability to create a complex layout in HTML with the added benefit of also having the HTML version of the file.
It's great for relatively small documents where design is key.
The problem is when I try to produce long report documents. These documents do not require elaborate design. These are pretty much just a header with some information, and then a simple table with lots of records that stretch far as the eye can see, so they tend to be large. Like, really really large.
When I try using puppeteer for that, well pretty much just crashes and burns because loading such huge layouts into the underlying browser is just too much.
Currently I do "stitching". I create the document by having puppeteer create the doc in parts, and then I connect all those "doclets" into one using PDFKit.
But then I have problems like when one "doclet" ends and a new one begins, there are blank lines. (partially empty pages for no good reason from the perspective of a customer viewing it)
What I'm looking for is a library that has basic layout functionality but that doesn't use a browser (or perhaps uses something lightweight).
Problem is that libraries like PDFkit and pdf-lib seem to be too low level.
I'm going to literally have to "draw" the documents by telling it where exactly the text should be.
If I want tables, I'm going to straight up have to draw rectangles and stuff.
Having to create all of this manually would be a nightmare.
All I want is the ability to create simple layouts (tables, titles, text wrapping, background color) without having to use a library that just launches chromium.
Please, let me know if you know of any such option.
Thanks in advance!
What I tried:
PDFkit/pdf-lib - too low level. Unless I'm getting something wrong, there doesn't seem to be a way to create word wrapped layouts with basic tables.
jsPDF doesn't seem to be able to use the HTML functionality on the server(I think to get it to work I'd have to let it use a browser...? if so, doesn't really help).
Puppeteer/other libraries that pilot a browser - well, it uses a browser so a no-go for large docs.
Praying to Odin - No luck so far.

What are known limitations of borb related to PDF versions?

I'm new to borb, which seems to me a very promising Python package.
Trying to load a small sample of PDF documents, just to put hands on, I've found that borb can open some of them without problems; in some cases I got messages such as "Unable to process XMP meta-data"; yet in other cases I got assertion errors.
Thus, before posting specific issues, I'm looking for information about current limitations of borb, with reference to PDF versions, and on tools I could use first to detect files to be considered invalid PDF documents. Thanks.
I'm using borb release v2.0.20, just cloned from GitHub, and Python 3.6.5 on Windows 10.
Disclaimer: I am Joris Schellekens, author of the aforementioned library borb.
The problem is that the PDF spec (ISO-32000) leaves some room for interpretation at various points throughout. That means some PDF libraries will interpret the spec in a given way, and produce documents that may not always be compliant according to other tools.
borb tends to be very strict when it comes to PDF parsing. As soon as an error is detected, it will throw the stacktrace right back at you. Whereas other PDF software (e.g. Adobe Reader) tend to be much more forgiving in terms of what they accept as input PDF documents.
Although I certainly understand your frustration at being unable to process what you perceive to be "perfectly good PDF documents", I assure you that processing them might lead to even more issues.
I know for instance that there are cases where Adobe Reader tries to correct a bad PDF document, and as a result ends up corrupting the signatures in the document (very undesirable).
If you experience issues, and you can share the PDF, feel free to log a ticket on the GitHub repository.
From the top of my head, the current limitations of borb are:
signatures
encrypted PDF documents
XREF not found
some images with transparent pixels

How to read excel(2007+ xlsx) sheet using actionscript(AIR)?

How to read excel(2007+ xlsx) sheet using actionscript(AIR)?
as3xls
An Actionscript 3 library for reading and writing Excel files. Currently reading numbers, text, and formulas from Excel version 2.0-2003 and writing numbers, text, and dates to Excel 2.0 is supported. No server-side help is needed.
SUPPORT INFORMATION
Documentation and samples are at http://code.google.com/p/as3xls/
I wrote this: https://github.com/childoftv/as3-xlsx-reader I'd love to know if it helps
Do you have any idea how... Inefficient this is?
Excel uses a complex setup for files, and unless you want to write a full-scale parser for its spreadsheets (which, believe me, will be difficult, alone to figure out what the format chars do), you'd be better off finding another solution.
Say, using a "save to XML" option would make your job a few thousand times easier, without exaggeration. AS3 has no native support for Excel, there is no real point for it to have such. But it has great integrated methods for working with XML.
If possible, save the Excel files to XML and parse those.
Better still, use databases, and parse them as XML through PHP.
I did a search and came up with this: http://code.google.com/p/php-excel-reader/
Once you've got it in PHP, passing it on to Flash is no problem at all. I'd recommend turning it into straight arrays of objects and converting it to AMF3 via Zend_Amf, AMFPHP or WebOrb, whichever one you're most comfortable with. You can then create tables, manipulate the data or whatever you like. It'd also be a lot faster and lighter than using XML.
PK
I took a look at the xlsx breakdown and it would take me 1 week to write an xlsx writer that could do basic formatting and formulas. I've only spent 1 hour perusing through the directories in an xlsx file and all you'd have to do is create the same directory structure...mostly cut and paste some strings..and then zip it and call it xlsx.
I tried this theory by manually making an xlsx file using 7zip. I downloaded childoftv's reader and, though I don't need the reader, the package includes a few zip/unzip classes that would prove helpful for anyone who wants to make a xlsx writer.
Long story short, the setup isn't complex, somebody just has to take a week out of their busy schedule to do it. I need this functionality so if nobody's done it yet, then I'll have to. Hopefully my search will find something better than a forum where the general consensus is "it's too hard, give up."

Combining resources into a single binary file

How does one combine several resources for an application (images, sounds, scripts, xmls, etc.) into a single/multiple binary file so that they're protected from user's hands? What are the typical steps (organizing, loading, encryption, etc...)?
This is particularly common in game development, yet a lot of the game frameworks and engines out there don't provide an easy way to do this, nor describe a general approach. I've been meaning to learn how to do it, but I don't know where to begin. Could anyone point me in the right direction?
There are lots of ways to do this. m_pGladiator has some good ideas, especially with seralization. I would like to make a few other comments.
First, if you are going to pack a bunch of resources into a single file (I call these packfiles), then I think that you should work to avoid loading the whole file and then deseralizing out of that file into memory. The simple reason is that it's more memory. That's really not a problem on PC's I guess, but it's good practice, and it's essential when working on the console. While we don't (currently) serialize objects as m_pGladiator has suggested, we are moving towards that.
There are two types of packfiles that you might have. One would be a file where you want arbitrary access to the contents of the files. A second type might be a collection of files where you need all of those files when loading a level. A basic example might be:
An audio packfile might contain all the audio for your game. You might only need to load certain kinds of audio for the menus or interface screens and different sets of audio for the levels. This might fall intot he first category above.
A type that falls into the second category might be all models/textures/etc for a level. You basically want to load the entire contents of this file into the game at load time because you will (likely) need all of it's contents while a player is playing that level or section.
many of the packfiles that we build fall into the second category. We basically package up the level contents, and then compresses them with something like zlib. When we load one of these at game time, we read a small amount of the file, uncompress what we've read into a memory buffer, and then repeat until the full file has been read into memory. The buffer we read into is relatively small while final destination buffer is large enough to hold the largest set of uncompressed data that we need. This method is tricky, but again, it saves on RAM, it's an interesting exercise to get working, and you feel all nice and warm inside because you are being a good steward of system resources. once the packfile has been completely uncompressed into it's destinatino buffer, we run a final pass on the buffer to fix up pointer locations, etc. This method only works when you write out your packfile as structures that the game knows. In other words, our packfile writing tools share struct (or classses) with the game code. We are basically writing out and compressing exact representations of data structures.
If you simply want to cut down on the number of files that you are shipping and installing on a users machine, you can do with something like the first kind of packfile that I describe. Maybe you have 1000s of textures and would just simply like to cut down on the sheer number of files that you have to zip up and package. You can write a small utility that will basically read the files that you want to package together and then write a header containing the files and their offsets in the packfile, and then you can write the contents of the file, one at a time, one after the other, in your large binary file. At game time, you can simply load the header of this packfile and store the filenames and offsets in a hash. When you need to read a file, you can hash the filename and see if it exists in your packfile, and if so, you can read the contents directly from the packfile by seeking to the offset and then reading from that location in the packfile. Again, this method is basically a way to pack data together without regards for encryption, etc. It's simply an organizational method.
But again, I do want to stress that if you are going a route like I or m_pGladiator suggests, I would work hard to not have to pull the whole file into RAM and then deserialize to another location in RAM. That's a waste of resources (that you perhaps have plenty of). I would say that you can do this to get it working, and then once it's working, you can work on a method that only reads part of the file at a time and then decompresses to your destination buffer. You must use a comprsesion scheme that will work like this though. zlib and lzw both do (I believe). I'm not sure about an MD5 algorithm.
Hope that this helps.
do as Java: pack it all in a zip, and use an filesystem-like API to read directly from there.
Personally, I never used the already available tools to do that. If you want to prevent your game to be hacked easily, then you have to develop your own resource manipulation engine.
First of all read about serializing objects. When you load a resource from file (graphic, sound or whatever), it is stored in some object instance in the memory. A game usually uses dozens of graphical and sound objects. You have to make a tool, which loads them all and stores them in collections in the memory. Then serialize those collections into a binary file and you have every resource there.
Then you can use for example MD5 or any other encryption algorithm to encrypt this file.
Also, you can use zlib or other compression library to make this big binary file a bit smaller.
In the game, you should load the encrypted binary file and unpack it. Then decrypt it. Then deserialize the object collections and you have all resources back in memory.
Of course you can make this more comprehensive by storing in different binary files the resources for different levels and so on - there are plenty of variants, depending on what you want. Also you can first zip, then encrypt, or make other combinations of the steps.
Short answer: yes.
In Mac OS 6,7,8 there was a substantial API devoted to this exact task. Lookup the "Resource Manager" if you are interested. Edit: So does the ROOT physics analysis package.
Not that I know of a good tool right now. What platform(s) do you want it to work on?
Edited to add: All of the two-or-three tools of this sort that I am away of share a similar struture:
The file starts with a header and index
There are a series of blocks some of which may have there own headers and indicies, some of which are leaves
Each leaf is a simple serialization of the data to be stored.
The whole file (or sometimes individual blocks) may be compressed.
Not terribly hard to implement your own, but I'd look for a good existing one that meets your needs first.
For future people, like me, who are wondering about this same topic, check out the two following links:
http://www.sfml-dev.org/wiki/en/tutorials/formatdat
http://archive.gamedev.net/reference/programming/features/pak/

How does the .doc format work?

I recently learned about the basic structure of the .docx file (it's a specially structured zip archive). However, docx is not formated like a doc.
How does a doc file work? What is the file format, structure, etc?
It's not a direct answer to your question, but I highly recommend reading Joel Spolsky's article, Why are the Microsoft Office file formats so complicated? (And some workarounds). It will give you some insight into how complex the .doc format really is - and why. Joel also gives a very basic overview of what the .doc format consists of:
You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file
systems inside a single file. These are sufficiently complicated that you have to read
another 9 page spec to figure that out. And these “specs” look more like C data
structures than what we traditionally think of as a spec. It's a whole hierarchical file
system.
(The quote refers to Excel files but it applies to Word docs as well). Informative article and helpful in understanding why .docx and ODF files are structured and designed so much more logically when being examined from an outside perspective.
The full format for binary .doc files is documented in this pdf from (the Wikipedia article on .doc)
The basic idea behind the MS Word DOC format is an OLE Compund Document which, as Kibbee has already written, is basically a memory dump. It's a very complex and convoluted way to store documents, but if you've ever really dug into the application Word you'll know how insanely many features it has, and if you have used it in a business setting you'll have a good feeling for how it integrates with other programs in the Office series.
In general, OLE Compund Documents are very extensible structures that allows you to stuff all kinds of data into one file and even to some degree handle data you don't have an application installed for. For example, if you insert an Equation object (from the MS Equation Editor) into a document it gets stored as a sub-object which is like a file inside the file, but this object doesn't just contain the data required for Equation Editor to edit and render it, it also has a generic bitmap (or metafile, maybe) representation stored so it can be displayed, though not edited, on a machine without Equation Editor installed.
This was the why, for the how you'll have to read the specifications other people have linked to already ;)
If you want the easy way out to work with the files though, make sure your software runs on a Windows machine with Word installed, then use COM/OLE Automation to open and manipulate the documents. You won't have to worry about file format then.
Doc is the binary format of word document - here's the Microsoft Office Word 97-2007 Binary File Format Specification [*.doc] document.
The .doc format is quite complex. Like most Microsoft formats, it reflects a long history of changes between versions and legacy support. They published it not too long ago, so if you want to view it (and other pre-Office 2007 formats), knock yourself out here.
There's Microsoft Word's .doc and then there's plain text .doc. It sounds like you're wondering about the proprietary Microsoft format.
From Wikipedia:
The DOC format varies among Microsoft Office Word Formats. Word versions up to 97 used a different format from Microsoft Word version between 97 and 2003.
It wasn't until Word 2007 where .docx, although a packaged file, is not necessarily a .zip archive. It is a structured XML document.

Resources