Do any of the Python compression module algorithms simply store the data for speed optimisation? - python-3.x

From Wikipedia, about ZPAQ Compression-
ZPAQ has 5 compression levels from fast to best. At all but the best level, it uses the statistics of the order-1 prediction table used for deduplication to test whether the input appears random. If so, it is stored without compression as a speed optimization.
I've been working with the Python Data Compression and Archiving module, and wonder if any of those implementations (ZLIB, BZ2, LZMA) do the same? Do any of them simply store the data 'as-is' when it looks almost random? I'm not a coding expert and can't really follow the source code.
Related: How to efficiently predict if data is compressible

Some incomplete / best-guess remarks:
LZMA2 seems to do that, although for different reasons: compression-ratio; not for improving compression-time.
This is indicated at wiki:
LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters.
The XZ LZMA2 encoder processes the input in chunks (of up to 2 MB uncompressed size or 64 KB compressed size, whichever is lower), handing each chunk to the LZMA encoder, and then deciding whether to output an LZMA2 LZMA chunk including the encoded data, or to output an LZMA2 uncompressed chunk, depending on which is shorter (LZMA, like any other compressor, will necessarily expand rather than compress some kinds of data).
The latter quote also shows that there is no expected compression-speed gain as it's more or less a: do both and pick best approach.
(The article seems to focus on xz-based lzma2; probably transfers to whatever is within python; but no guarantees)
Above, together with python's docs:
Compression filters:
FILTER_LZMA1 (for use with FORMAT_ALONE)
FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)
would make me think you got everything you need and just need to use the right filter.
So check your reasoning again (time- or compression-ratio) and try the lzma2-filter with custom-prepared mixed data (if you don't want to trust blindly).
Intuition i don't expect the more classic zlib/bz2 formats to exploit uncompressable data (but it's a pure guess).

Related

Re-order lines in textfile for better compression ratio

I have lots of huge text-files which need to be compressed with the highest ratio possible. Compression speed may be slow, as long as decompression is reasonably fast.
Each line in these files contains one dataset, and they can be stored in any order.
A Similar problem to this one:
Sorting a file to optimize for compression efficiency
But for me compression speed is not an issue. Are there ready-to-use tools to group similar lines together? Or maybe just an algorithm I can implement?
Sorting alone gave some improvement, but I suspect a lot more is possible.
Each file is about 600 million lines long, ~40 bytes each, 24GB total. Compressed to ~10GB with xz
Here's a fairly naïve algorithm:
Choose an initial line at random and write to the compression stream.
While remaining lines > 0:
Save the state of the compression stream
For each remaining line in the text file:
write the line to the compression stream and record the resulting compressed length
roll back to the saved state of the compression stream
Write the line that resulted in the lowest compressed length to the compression stream
Free the saved state
This is a greedy algorithm and won't be globally optimal, but it should be pretty good at matching together lines that compressed well when followed one after the other. It's O(n2) but you said compression speed wasn't an issue. The main advantage is that it's empirical: it doesn't rely on assumptions about which line order will compress well but actually measures it.
If you use zlib, it provides a function deflateCopy that duplicates the state of the compression stream, although it's apparently pretty expensive.
Edit: if you approach this problem as outputting all lines in a sequence while trying to minimize the total edit distance between all pairs of lines in the sequence then this problem reduces to the Travelling Salesman Problem, with the edit distance as your "distance" and all your lines as the nodes you have to visit. So you could look into the various approaches to that problem and apply them to this. Even then, the optimal TSP solution in terms of edit distance isn't necessarily going to be the file that compresses the smallest/

Is there a binary kind of SVG?

It just seems to me that when writing code for dynamic data visualization, I end up doing the same things over and over in different languages/platforms. Now if I had a cross platform language(which I do) and something like a binary version of SVG, I could make my code target that and use/create interpreters for whatever platform I currently need to use it on.
The reason I don't want SVG is because the plaintext part makes it too slow for my purposes. I could of course just create my own intermediary format but if there is something already out there that's implemented by various things then the less work for me!
Depending on what you mean by “too slow”, the answer varies:
Filesize too large
Officially, the closest thing SVG has to a binary format is SVGZ, which is a gzipped SVG file with the .svgz extension. All conforming SVG viewers should be able to open it. Making one is simple on *nix systems:
gzip yourfile.svg && mv yourfile.svg.gz yourfile.svgz
You could also try Brotli compression, which tends to have smaller filesize at the cost of more compression time.
Including other assets is inefficient
SVG can only bundle bitmaps and other binary data through base64 encoding, which has a fair amount of overhead.
PDF can include “streams” of raw binary data, and is surprisingly efficient when programmatically generated.
Parsing the text data takes too long
This is tricky. PDF and its brother, Encapsulated PostScript, are also old, well-supported vector graphic formats. Unfortunately, they too are also text at their core, with optional compression.
You could try Computer Graphics Metafiles, which can be compiled ahead of time. But I’m unsure how well-supported they are across consumer devices.
From a comment:
Almost nothing about the performance of SVG other than the transmission cost of sending it over a network is down to it being plaintext
No, that's completely wrong. I worked at CSIRO using XML for massive 3D models. GeoScience Australia did a formal study into the parsing speed - parsing floating point numbers from text is relatively expensive for big data sets, compared to reading a 4 or 8 byte binary representation.
I've spent a lot of time optimising my internal binary formats for Touchgram and am now looking at vector art.
One of the techniques you can use is a combination of
variable-length integer coding and
normalising your points to a scale represented by integers, then storing paths as sequences of deltas
That can yield paths where often only 1 or 2 bytes are used per step, as opposed to the typical 12.
Consider a basic line
<polyline class="Connect" points="100,200 100,100" />
I could represent that with 4 bytes instead of 53.
So far, all I've been able to find in binary SVG is this post about a Go project linking to the project description and repo
Adobe Flash SWF files may work. Due to its previous ubiquity, 'players' and libraries were written for many platforms. The specifications were open and license permitting. For simple 2D graphics, earlier, more compatible versions would do fine.
The files are binary and extraordinarily small.

Linux native config file format read efficiency

Is there a common practice on reading linux config files in a native format efficiently? Files are handled as streams, hence I guess the standard way to lookup by parsing is in O(n*m) (Let n lines, m avg line length).
Is it common practice to build search trees from config files, or has this already been implemented, by e.g. QSettings?
Why is this a problem? QSettings usually has very few kb of data, which can be loaded and parsed in a fraction of time. You don't really need to care about it. Reading kbs in a desktop is something which cannot be measured.
The recommendation is usually to create a QSettings object when needed (in stack) and use it to read when needed. This is what I see in some Qt applications. The Qt documentation mentions this is light weight. I usually maintain a global variable in my QMainWindow.

what is the equivalent of the DirectDraw Surface (DDS) format for opengl on linux?

DDS format has been made for directX right ? so it's should be optimized for it and not for openGL I guess.
So, there is another format(s) ? if yes, what format is a good choice ? what reason(s) ?
also, since I'm working on linux, I'm also concerned by making textures on linux. So I need a format who can be imported/exported by gimp.
The DDS format is useful for storing compressed textures. If you store the file in the same compression as it will be stored in the GPU memory, you don't need to decode and re-encode for GPU storage, instead you can just move it directly to memory.
The DDS format is basically used to store S3 Texture Compression data. The internal DDS formats DTX3 and DTX5 are for example S3TC formats that are also supported by OpenGL:
http://www.opengl.org/wiki/S3_Texture_Compression
DDS also can store pre-calculated mipmaps. Again this is a trade-off (larger file size) for reducing loading times, as the mipmaps could also be calculated at loading time.
As you can see, if you have the right code to parse the DDS file, e.g. the payload will be taken in its compressed form and not decoded on the host machine, then it is perfectly fine to use a DDS.
For an alternative, #CoffeeandCode pointed out the KTX format in his answer. These files use a different compression algorithm (see here). The advantage is that this compression is mandatory in newer OpenGL versions, while S3TC compression was always only available as an extension (and has patent problems). I don't know how they compare in quality and if you can expect OpenGL 4.3 on your target platforms.
My Take: If you are targeting recent hardware and OpenGL support (OpenGL ES 3 or OpenGL 4.3), you should use the KTX format and respective texture formats (libktx will generate the texture objects for you). If you need to be able to run on older hardware or happen to already have a lot of DDS data, you should probably stick with DDS for the time being.
There is nothing particularly D3D-"optimized" about DDS.
Once you read the header correctly, the (optionally) pre-compressed data is binary compatible with S3TC. Both DDS and OpenGL's equivalent (KTX) are capable of storing cubemap arrays and mipmaps in a single image file, that is their primary appeal.
With other formats, you may wind up using the driver to compress and/or generate mipmaps, and the driver usually does a poor job quality wise. The drivers are usually designed to do this quickly because they have a lot of textures that need to be compressed / mipmapped. You can use a higher quality mipmap downsample filter / compression implementation offline since the amount of time required to complete is rather unimportant.
The key benefits of DDS / KTX are:
Offline computation of mipmaps
Offline compression
Store all layers of a texture array/cubemap
Doing (1) and (2) offline can both improve image quality and reduce the overhead of loading textures at run-time. (3) is mostly for convenience, but a welcomed one.
I think the closest equivalent to DDS for DirectX is KTX, but even DDS works fine under OpenGL once parsed.

visualization of compressed (deflated, gzipped) content structures

I have some ideas I would like to experiment with relating to data compression, but am finding it difficult to decipher some parts of how the standard are applied "in real life". I would like to look at some sample compressed files to observe how the the blocks are arranged and the huffman tree(s) are structured.
Are there any tools in existence which can help visualize this for a given compressed file (zip/gzip/deflate etc)? I'm picturing something like a tree view or some form of graph visualizer.
You might be interested in this (if you are still interested that is :-P)
http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equals-awesome/
I made a "entropy image" tool.
The entropy_image tool replaces each pixel
with the (estimated) number of bits
necessary to encode that pixel using range coding or Huffman compression.
I hope this isn't the only compression visualization tool in the world.

Resources