Brotli compression multithreading - multithreading

It's my understanding Brotli stores blocksize information in a meta-block header with only the final uncompressed size of the block, and no information about the compression length (9.2). I'm guessing that a wrapper would be need to be created in order to use it with multiple threads, or possibly something similar to Mark Adler's pigz.
Would the same threading principles apply to Brotli as they do with gzip in this case, or are there any foreseeable issues to be aware of when it comes to multithreading implementations?

You can use the brotli format as is for this purpose. I got them to add the option of putting metadata in empty meta-blocks (where "empty" means that the meta-block produces zero uncompressed data). You can put markers in metadata to aid in finding meta-blocks. An inserted empty meta-block also starts the next meta-block at a byte boundary.
Each meta-block can be independent of the other meta-blocks. If the stream is constructed that way, then there is no issue with combining them when compressing or separately decompressing them. The areas of possible dependency are the ring buffer of the four last distances used, and backwards references past the beginning of the current meta-block. For parallel use, a meta-block can and must be constructed so as to not depend on the last four distances, not referring to the ring buffer until it has been filled with distances from the current meta-block. In addition, distances that reach back before the current meta-block would not be allowed (which includes no static references). Lastly you would append an empty or metadata meta-block to bring the sequence to a byte boundary for easy concatenation.
By the way, it looks like you're linking to an older version of the draft format. Here is a link to the current version.

Related

Haskell alternative for Doubly-linked-list coupled with Hash-table pattern

There's a useful pattern in imperative programming, namely, a doubly-linked-list coupled with a hash-table for constant time lookup in the linked list.
One application of this pattern is in LRU cache. The head of the doubly-linked-list will contain the least recently used entry in the cache and the last element in the doubly-linked-list will contain the most recently used entry. The keys in the hash-table are keys of the entries and the values are pointers to nodes in the linked-list corresponding to the key/entry. When an entry is queried in the cache, hash-table will be used to point to its node in the linked-list and then the node will be removed from its current location in the linked-list and be placed at the end of the linked-list making it the most-recently-used entry. For eviction, we simply remove entries from the head of the linked-list as they are the least recently used ones. Both lookup and eviction operations will take constant time.
I can think of implementing this in Haskell using two TreeMaps and I know that the time complexity will be O(log n). But I am a little uncomfortable as the constant factor in the time complexity seems a little high. Specifically, to perform a look-up, first I need to check if the entry exists and save its value, then I need to first delete it from the LRU map and re-insert it with a new key. This means that each lookup will result in a root-to-node traversal three times.
Is there a better way of doing this in Haskell?
As comments indicate, mutable vectors are perfectly acceptable when required. However, I think there's an issue with the way you've stated the question - unless the idea is to duplicate "as closely as possible" (without mutable structures) the imperative code, why bother having 2 treemaps? A single priority search queue (see packages pqueue or PSQueue) would be an appropriate structure whilst maintaining purity. It supports efficiently both priorities (for eviction) and searching (for lookups of your desired cached argument).
On a related note, some structures support eg. Data.Map's alterF, which effectively provides you with a continuation allowing you to "do something else" dependent on the Maybe value at a key, but "remembering" where you are and thus avoiding to pay the full cost to re-traverse the structure to subsequently modify at this key. See also the at lens.

DEFLATE (RFC1951) dynamic huffman "incomplete length"

I've been studying RFC1951 and 'puff.c', and have a question about the issue of "incomplete length".
As near I can tell, defining a "dynamic" Huffman code table that allows for more codes than specified by HLIT+257 will produce an error, at least by puff.c. For example, an error is produced by 'puff.c' if, as a simple debugging test, I were to use a Huffman table of all 9-bit codes to define only 257 lit/lens. Is this outcome purposeful or a bug? And can I assume that any "inflator" based on the 'zlib' library will produce the same error?
I can't find any specification in RFC 1951 that should REQUIRE the use of a sufficiently tight Huffman code. Certainly, I can see that using an "under-subscribed" Huffman table might be inefficient, in terms of compression, but I'm not sure why such a table should be prohibited from use.
My interest isn't simply hypothetical. I really want to use an under-subscribed, literal-only, Huffman code (but NOT the example cited above) to compress some application specific images into PNG files. But I want to make sure it will work with any PNG image viewer.
The RFC specifies that the codes are Huffman codes, which by definition are complete codes. (Complete means that all bit patterns are used.)
zlib will reject incomplete or oversubscribed codes, except in the special case noted in the RFC:
If only one distance code is used, it is encoded using one bit, not
zero bits; in this case there is a single code length of one, with one
unused code.
There the incomplete code 0 for the single symbol, with code 1 unused, is permitted.
(That, by the way, is unnecessary. If there is only one distance symbol, then you don't need any bits to specify it. You know that that distance symbol must be used with any length. If that symbol needs extra bits, then those extra bits immediately follow the length. But, oh well -- for that case Phil Katz put an extraneous zero bit in every match, and now we're stuck with it.)
The fact that the RFC even had to note this special case is another clue that incomplete codes are not accepted otherwise.
There is sort of another exception in deflate, in that the fixed literal/length code is incomplete, with two unused codes at the end.
The bottom line is, no, you will not be able to use an incomplete code in a dynamic header (except the special case) and expect zlib or any compliant deflate decoder to be able to decode it.
As for why this strictness is useful, constraints on dynamic headers permit rapid detection of non-deflate streams or corrupted deflate streams. Similarly, a dynamic header with no end code is not permitted by zlib, so as to avoid the case of a bogus dynamic header permitting any following random bits to be decodable forever, never detecting an error. The unused fixed codes also help in this regard, since eventually they trigger an error in random input.
By the way, if you want to define a fixed, complete Huffman code for your case, it's super simple, and would reduce the size of almost all of your codes by one bit. Just encode eight bits for the symbols 0..253, using that symbol number directly as the code (reversing the bits of course), and nine bits for symbols 254..257, using the codes 508..511 (bits reversed).

Important algorithm involving random access to a string?

I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.

Space leaks with Haskell's cereal library?

As a hobby project called 'beercan', I'm reverse-engineering the resource files of the Torchlight games. Using an okay-ish hex editor, I try to guess the structure of the files, and then I model my ideas, use cereal to write Getters (and later some Putters), and try to decode every file in an application of the library.
I've just started on Torchlight's compiled layout files (*.LAYOUT in TL1, *.LAYOUT.cmp in TL2). The format turns out to be a little trickier than the dat files, but I think I figured out the basic structure, and how they are encoded in the TL2 files. so I'm trying to make a map of file versions, tag numbers, and guessed data types.
To do so, I wrote an application that flattens the data structure, leaving only the guessed type of the values of the leaves, each annotated with the file version and the node and leaf tag numbers. I turn this into a map from the file version and tag numbers to a set of the guessed types. For every file, I'd expect this Map to maybe take twice the file size in memory. (Not sure, though.) Then, I merge these maps, and I print the map.
For some reason, even if I only take 20MB worth of files (100 files), memory usage increases linearly to about 200MB, then decreases to the final size of the resulting map, and then deflates rapidly as I print it.
I wouldn't expect this memory usage. Does anyone know how I could fix it? I've tried to force values after decoding them (using deepseq), I've tried adding bangs to data types, but this hasn't really helped. I've tried copying all bytestrings I keep in the file structure, which brought down the memory usage a bit, but it's still unacceptably high, especially when I want to analyze the entire dataset (200MB+ of original files).
-edit- I've pushed a (not very S)SCCE to demonstrate the performance issue, (accidentally) along with my profiling results.
Clone the repository.
cabal configure, with flags to enable profiling (is it normal to need --enable-library-profiling --enable-executable-profiling --ghc-options="-rtsopts -prof"?)
cabal build
cd test, and run StressTest.sh.
This script tries to load a regular TL2 layout file 100 times. On my machine, top says it takes about 500MB of memory, and the profiling results are consistent with my description above.
I totally agree with #petrpudlak, we would need actual code to make any meaningful comments to the question "why does my code use so much memory?" :) (sorry, you did offer code), however, some of the patterns you describe are pretty typical in Haskell and some generic discussion is possible.
First of all, note that native Haskell types use a lot more memory than you might guess. Take a look at the ghc memory footprint page at http://www.haskell.org/haskellwiki/GHC/Memory_Footprint. Note that even a simple Char will take a full 16 bytes of memory! Add to that pointers for linked list items in a String, and you will easily use more than an order of magnitude greater memory than you might have guessed. If memory is important, you should use another data type, like Data.Text or Data.ByteString, which store Strings internally more like c would (as a block of bytes in memory, with 1-4 bytes per char, depending on encoding and what char is used). If data other than Strings are the problem, you can use unboxed arrays for arbitrary data types.
Second of all, if possible, you can cut down memory usage by processing items in series (where the memory will be garbage collected right away). Haskell laziness often does this for you automatically, for instance, try to run the following program
import Data.Char
main = interact $ map toUpper
As you type, the output will appear continuously (your OS, not Haskell, may buffer full lines, so you may need to hit 'enter' before seeing anything, but you will see output update for each 'enter'). Rather than loading the whole input into memory and then processing all at once, Char memory is being created and garbage collected Char by Char.
Of course this isn't always possible (ie- if you have to process the data in a very nonlocal way), but most of the time at least parts of the code can be refactored this way to cut down total memory usage.
Edit- Sorry, I just realized that you did post a link to the code, and you are using ByteString..... So some of what I wrote isn't valid. But I do still see boxed lists and unpacking of the ByteString, so I will leave the answer as it is.
The memory usage pattern sounds like your application is building up a lot of unnecessary thunks and then memory consumption starts going down when those thunks get evaluated. I only glanced at your code quickly but one simple change you could try is to replace all imports of Data.Map with Data.Map.Strict. This is especially important if you are doing a lot of updates on the values inside a Map without forcing evaluation in between.
Another things you should be aware of is that replicateM is quite inefficient with larger numbers in a strict monad (see e.g. this answer). I'm not sure what kinds of counts you are usually dealing with in your application, but it's good to keep in mind.
It might also help to use strict fields in simple container data types like your LeafValue type and compile with -funbox-strict-fields (and -O2 of course).

Most efficient data structure to add styles to text

I'm looking for the best data structure to add styles to a text (say in a text editor). The structure should allow the following operations:
Quick lookup of all styles at absolute position X
Quick insert of text at any position (styles after that position must be moved).
Every position of the text must support an arbitrary number of styles (overlapping).
I've considered lists/arrays which contain text ranges but they don't allow quick insert without recalculating the positions of all styles after the insert point.
A tree structure with relative offsets supports #2 but the tree will degenerate fast when I add lots of styles to the text.
Any other options?
I have never developped an editor, but how about this:
I believe it would be possible to expand the scheme that is used to store the text characters themeselves, depending of course on the details of your implementation (language, toolkits etc) and your performance and resource usage requirements.
Rather than use a separate data structure for the styles, I'd prefer having a reference that would accompany each character and point to an array or list with the applicable characters. Characters with the same set of styles could point to the same array or list, so that one could be shared.
Character insertions and deletions would not affect the styles themeselves, apart from changing the number of references to them, which could be handled with a bit of reference counting.
Depending on your programming language you could even compress things a bit more by pointing halfway into a list, although the additional bookkeeping for this might in fact make it more inefficient.
The main issue with this suggestion is the memory usage. In an ASCII editor written in C, bundling a pointer with each char would raise its effective memory usage from 1 byte to 12 bytes on a 64 bit system, due to struct alignment padding.
I would look about breaking the text into small variable size blocks that would allow you to efficiently compress the pointers. E.g. a 32-character block might look like this in C:
struct _BLK_ {
unsigned char size;
unsigned int styles;
char content[];
}
The interesting part is the metadata processing on the variable part of the struct, which contains both the stored text and any style pointers. The size element would indicate the number of characters. The styles integer (hence the 32-character limit) would be seen as a set of 32 1-bit fields, with each one indicating whether a character has its own style pointer, or whether it should use the same style as the previous character. This way a 32-char block with a single style would only have the additional overhead of the size char, the styles mask and a single pointer, along with any padding bytes. Inserting and deleting characters into a small array like this should be quite fast.
As for the text storage itself, a tree sounds like a good idea. Perhaps a binary tree where each node value would be the sum of the children values, with the leaf nodes eventually pointing to text blocks with their size as their node value? The root node value would be the total size of the text, with each subtree ideally holding half of your text. You'd still have to auto-balance it, though, with sometimes having to merge half-empty text blocks.
And in case you missed it, I am no expert in trees :-)
EDIT:
Apparently what I suggested is a modified version of this data structure:
http://en.wikipedia.org/wiki/Rope_%28computer_science%29
as referenced in this post:
Data structure for text editor
EDIT 2:
Deletion in the proposed data structure should be relatively fast, as it would come down to byte shifting in an array and a few bitwise operations on the styles mask. Insertion is pretty much the same, unless a block fills up. It might make sense to reserve some space (i.e. some bits in the styles mask) within each block to allow for future insertions directly in the blocks, without having to alter the tree itself for relatively small amounts of new text.
Another advantage of bundling characters and styles in blocks like this is that its inherent data locality should allow for more efficient use of the CPU cache than other alternatives, thus improving the processing speed to some extent.
Much like any complex data structure, though, you'd probably need either profiling with representative test cases or an adaptive algorithm to determine the optimal parameters for its operation (block size, any reserved space etc).

Resources