When to form a new DEFLATE block? - zip

When compressing a file or directory into a zip file using DEFLATE, when should a new DEFLATE block be formed? Furthermore, since the maximum code length is 15 bits in DEFLATE, should a new block be formed whenever the Huffman tree exceeds a depth of 15? Thanks!

Whenever you like, but not too often.
No. You can squash the Huffman tree.
zlib emits a deflate block once a selected number of literals + length/distance pairs have been generated. By default, that number is 16383. It can be changed as part of memory usage option. At the end, the last block has whatever remains.
zopfli tries to be more intelligent by making large blocks and splitting them so long as the compression ratio goes up, stopping when the the next split would make the compression ratio go down.
You don't want deflate blocks to be too small, because then the size of the dynamic header describing the codes used in the block will become a significant factor in the size, reducing the compression ratio. You don't want the blocks to be too large, because then the codes, fixed for the duration of the block, will not be able to adapt to local statistical variations in the data being compressed.
As for the maximum depth, zlib and other deflators will happily make blocks for which a code has a depth greater than 15 by the normal Huffman algorithm. They will then squash the code down to make the depth 15.

Related

Node zlib incremental inflate

I've located the end of a local file header in download stream of a large zip file that
specifies deflate compression with
bit 3 set indicating the length of the compressed data follows the compressed data
and would like to now inflate that data using Node zlib but I cannot figure out how to feed data into zlib and receive feedback telling me when the deflate stream has self terminated.
Does Node's zlib library support consuming chunks of deflate data and returning a result letting the caller know when the deflate stream has ended?
Or is this an insane thing to do because it would imply I'm inflating on the UI thread and what I should really do is save the downloaded file and once downloaded use an NPM package? Hm.. well.. either the network is faster than inflation in which case streaming inflation would slow the network (bummer) or the network is slower than streaming inflation so why deflate while streaming (which I can't figure out how to do anyway) when I could simply saving to disk and reload-deflate while I'm sitting around waiting for the network..
Still, for my edification, I'd still like to know if Node supports streaming inflation.
var zlib = require('zlib')
var data = bufferOfChunkOfDeflatedData
var inflate = zlib.createInflate();
var stream = inflate.pipe(fs.createWriteStream(path));
var result = stream.write(data);
// but result doesn't indicate if the inflate stream has terminated...
Describes deflate headers and how they encode the length of the stream:
https://www.bolet.org/~pornin/deflate-flush-fr.html
In memory stream:
https://www.npmjs.com/package/memory-streams
Well, this guy just pulls till he hits the magic signature! :) https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L152
Node code for configuring convenience method:
https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L83
Specifically: https://nodejs.org/api/zlib.html#zlib_zlib_inflateraw_buffer_options_callback
Eh, looks like node is setup to return the decompressed buffer as one block to the callback; Doesn't look like node is setup to figure out the end of the deflate stream.
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback says The callback function must be called only when the current chunk is completely consumed. and here's the spot where it passes the chunk to zlib https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L358. So there's no opportunity to say the stream was partially consumed..
But then again... https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L161 but not really. Also just checks for the magic sig: https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L153
And from the zip spec:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
So looks like everyone just looks for the sig.
Mark says that's a no-no... So don't do that. And know that if your using an NPM lib to unzip, then there's a good chance the lib is doing that. To do it right would require, I think, grocking this from the zlib API docs: https://zlib.net/manual.html
The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.
This seems to indicate the final compressed bit will not be byte aligned. Yet the ZIP spec seems to indicate header that starts with the magic sig, the one everyone is using but shouldn't, is byte aligned: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.9.1 This descriptor MUST exist if bit 3 of the general
purpose bit flag is set (see below). It is byte aligned
and immediately follows the last byte of compressed data.
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device. For ZIP64(tm) format
archives, the compressed and uncompressed sizes are 8 bytes each.
How can the end of the deflate stream not be byte aligned but the following Data descriptor be byte aligned?
Is there a nice reference implementation?
Reference impl using Inflate with Z_BLOCK: https://github.com/madler/zlib/blob/master/examples/gzappend.c
This guys reads backwards to pull out the directory: https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 Is this necessary?
This guy seems to think that zips cannot be inflated without reading the whole file to get to the directory: https://www.npmjs.com/package/yauzl#no-streaming-unzip-api
I don't see why that would be the case. The streams describe their length... and Mark verifies they can be streamed.
And here is where Node.js checks for Z_STREAM_END!
It looks like it does, since the documentation lists zlib.constants.Z_STREAM_END as a possible return value.

Deflating data from MSZIP format

I'm trying to read a compressed binary .x mesh file but my decompression is failing. The file is basically some directx header info and then a bunch of data in MSZIP format (i.e. 2 bytes are an int blockSize, 2 bytes are a "magic number" and then there are blockSize deflated bytes and then repeat until there's no more data) so for each block I'm just getting the compressed bytes and deflating like so-
internal static byte[] DecompressBlock(byte[] data) {
var result = new List<byte>();
var ms = new MemoryStream(data);
var ds = new DeflateStream(ms, CompressionMode.Decompress);
var newStream = new MemoryStream();
ds.CopyTo(newStream);
ds.Flush();
ds.Close();
return newStream.GetBuffer();
}
The first block deflates as expected. Subsequent blocks are the right inflated size but, seemingly randomly, some bytes are 0 when they shouldn't be, usually in in groups of 4-12.
How I can deflate different blocks of zipped data while maintaining the same history buffer?
UPDATE: After a little more research it looks like in MSZIP compression these blocks ARE the results of separate deflate operations but the "history buffer" is maintained between them I don't know if deflatestream is gonna be able to handle this. updated the actual question
Yes, there is something that you are missing. Each deflate block can and does use history from the previous deflate block. So at each block you must initialize the deflate dictionary with the last 32K of uncompressed data from the previous blocks.
If anyone is trying to read/write the MSZIP format for DirectX meshes using C# only, I found out it is possible to do it with SharpZipLib.
For reference, the format of the compressed DirectX file is the following:
16 bytes -> DirectX header
4 bytes -> total uncompressed file size (including the 16 bytes of its header)
2 bytes -> size of uncompressed block, up to 32KB
2 bytes -> size of compressed block, plus 2 (because it includes the magic number's size)
2 bytes -> magic number ("CK" in ascii)
a compressed block
Parts 3-6 are repeated until the end of file. In practice all blocks but the last will be 32KB when uncompressed.
To decompress the file, you will have to implement the logic that extracts each compressed block, then give them to SharpZipLib's Zip.Compression.Inflater class. Reuse the same inflater for all blocks, but call its Reset method between each block.
This partly works, but you may get the "broken uncompressed block" error for some blocks. To overcome this problem, you will have to modify SharpZipLib's source, specifically the file Inflater.cs. The change is trivial - all you have to do is to disable/skip the DECODE_STORED_LEN2 case and go directly to DECODE_STORED instead.
To compress a file, split its contents into 32KB blocks, and the same logic applies: feed each block to the same Deflater and call Reset between each call. Again, you will have to modify the file DeflaterHuffman.cs, removing the line pending.WriteShort(~storedLength).

How can we distinguish deflate stream from deflateRaw stream?

Some HTTP servers send deflate raw body (without zlib headers) instead of actual deflate body. See discussion at: Why do real-world servers prefer gzip over deflate encoding?
Is it possible to detect them and handle inflate properly in Node.js? I mean besides try to createInflate them and catch error then try createInflateRaw again.
If the first byte in hex has a low nybble of 8, then it is a zlib stream. Otherwise it is a raw deflate stream. (Assuming that you know a priori that the only possible choices are a valid zlib stream or a valid deflate stream.) A raw deflate stream will never have an 8 in the low first nybble, but a zlib stream always will.
Background:
The zlib header format puts the compression method in the low nybble of the first byte. That compression method is always 8 for deflate.
The bit sequence in a raw deflate stream starts from the least significant bits of the bytes. If the first three bits are 000 (as they are for an 8), that signifies a stored (not compressed block), and it is not the last block. Stored blocks put the bytes of the input on byte boundaries. So the next thing that is done by the compressor after writing the 000 bits is to fill out the rest of the byte with zero bits to get to the next byte boundary. Therefore the next bit will never be a 1, so it is not possible for a valid deflate stream to have the first four bits be 1000, or the first nybble to be 8. (Note that the bits are read from the bottom up.)
The first (i.e. low) nybble of a valid deflate stream can only be 0..5 or a..d. If you see 6..9, e, or f, then it is not a valid deflate stream.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

How to determine the compression level of DEFLATE?

There are ten different compression levels for DEFLATE (0 no compression & fastest, 9 best compression & slowest). What is the best way to determine such level for a raw DEFLATE data?
One obvious (yet slow) method would be to try each and compare sequentially. As a side question, is it guaranteed that the size of compressed data for a file is strictly non-increasing going from compression level 0 to 9? If so, binary search can speed up this procedure by a factor of two/three.
If you only have compressed data, it does not contain such information. Compression level is only configurable for compression so it's not encoded in the compressed data.
However, if you use something like a zlib, it does add header which includes compression level. From https://www.rfc-editor.org/rfc/rfc1950 :
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it
is there to indicate if recompression might be worthwhile.
If you don't use library that adds informational header, you could implement it yourself (if that's really needed for your application). It's just a matter of putting extra byte or two (usually) in the beginning.
Other than the slow method, no.
No, there is not a guarantee that the compressed size is monotonic. However not being monotonic is pretty rare.

Resources