How to determine the compression level of DEFLATE?

How to determine the compression level of DEFLATE? - zip

There are ten different compression levels for DEFLATE (0 no compression & fastest, 9 best compression & slowest). What is the best way to determine such level for a raw DEFLATE data?
One obvious (yet slow) method would be to try each and compare sequentially. As a side question, is it guaranteed that the size of compressed data for a file is strictly non-increasing going from compression level 0 to 9? If so, binary search can speed up this procedure by a factor of two/three.

If you only have compressed data, it does not contain such information. Compression level is only configurable for compression so it's not encoded in the compressed data.
However, if you use something like a zlib, it does add header which includes compression level. From https://www.rfc-editor.org/rfc/rfc1950 :
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it
is there to indicate if recompression might be worthwhile.
If you don't use library that adds informational header, you could implement it yourself (if that's really needed for your application). It's just a matter of putting extra byte or two (usually) in the beginning.

Other than the slow method, no.
No, there is not a guarantee that the compressed size is monotonic. However not being monotonic is pretty rare.

Related

When to form a new DEFLATE block?

When compressing a file or directory into a zip file using DEFLATE, when should a new DEFLATE block be formed? Furthermore, since the maximum code length is 15 bits in DEFLATE, should a new block be formed whenever the Huffman tree exceeds a depth of 15? Thanks!

Whenever you like, but not too often.
No. You can squash the Huffman tree.
zlib emits a deflate block once a selected number of literals + length/distance pairs have been generated. By default, that number is 16383. It can be changed as part of memory usage option. At the end, the last block has whatever remains.
zopfli tries to be more intelligent by making large blocks and splitting them so long as the compression ratio goes up, stopping when the the next split would make the compression ratio go down.
You don't want deflate blocks to be too small, because then the size of the dynamic header describing the codes used in the block will become a significant factor in the size, reducing the compression ratio. You don't want the blocks to be too large, because then the codes, fixed for the duration of the block, will not be able to adapt to local statistical variations in the data being compressed.
As for the maximum depth, zlib and other deflators will happily make blocks for which a code has a depth greater than 15 by the normal Huffman algorithm. They will then squash the code down to make the depth 15.

Node zlib incremental inflate

I've located the end of a local file header in download stream of a large zip file that
specifies deflate compression with
bit 3 set indicating the length of the compressed data follows the compressed data
and would like to now inflate that data using Node zlib but I cannot figure out how to feed data into zlib and receive feedback telling me when the deflate stream has self terminated.
Does Node's zlib library support consuming chunks of deflate data and returning a result letting the caller know when the deflate stream has ended?
Or is this an insane thing to do because it would imply I'm inflating on the UI thread and what I should really do is save the downloaded file and once downloaded use an NPM package? Hm.. well.. either the network is faster than inflation in which case streaming inflation would slow the network (bummer) or the network is slower than streaming inflation so why deflate while streaming (which I can't figure out how to do anyway) when I could simply saving to disk and reload-deflate while I'm sitting around waiting for the network..
Still, for my edification, I'd still like to know if Node supports streaming inflation.
var zlib = require('zlib')
var data = bufferOfChunkOfDeflatedData
var inflate = zlib.createInflate();
var stream = inflate.pipe(fs.createWriteStream(path));
var result = stream.write(data);
// but result doesn't indicate if the inflate stream has terminated...
Describes deflate headers and how they encode the length of the stream:
https://www.bolet.org/~pornin/deflate-flush-fr.html
In memory stream:
https://www.npmjs.com/package/memory-streams
Well, this guy just pulls till he hits the magic signature! :) https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L152
Node code for configuring convenience method:
https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L83
Specifically: https://nodejs.org/api/zlib.html#zlib_zlib_inflateraw_buffer_options_callback
Eh, looks like node is setup to return the decompressed buffer as one block to the callback; Doesn't look like node is setup to figure out the end of the deflate stream.
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback says The callback function must be called only when the current chunk is completely consumed. and here's the spot where it passes the chunk to zlib https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L358. So there's no opportunity to say the stream was partially consumed..
But then again... https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L161 but not really. Also just checks for the magic sig: https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L153
And from the zip spec:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
So looks like everyone just looks for the sig.
Mark says that's a no-no... So don't do that. And know that if your using an NPM lib to unzip, then there's a good chance the lib is doing that. To do it right would require, I think, grocking this from the zlib API docs: https://zlib.net/manual.html
The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.
This seems to indicate the final compressed bit will not be byte aligned. Yet the ZIP spec seems to indicate header that starts with the magic sig, the one everyone is using but shouldn't, is byte aligned: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.9.1 This descriptor MUST exist if bit 3 of the general
purpose bit flag is set (see below). It is byte aligned
and immediately follows the last byte of compressed data.
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device. For ZIP64(tm) format
archives, the compressed and uncompressed sizes are 8 bytes each.
How can the end of the deflate stream not be byte aligned but the following Data descriptor be byte aligned?
Is there a nice reference implementation?
Reference impl using Inflate with Z_BLOCK: https://github.com/madler/zlib/blob/master/examples/gzappend.c
This guys reads backwards to pull out the directory: https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 Is this necessary?
This guy seems to think that zips cannot be inflated without reading the whole file to get to the directory: https://www.npmjs.com/package/yauzl#no-streaming-unzip-api
I don't see why that would be the case. The streams describe their length... and Mark verifies they can be streamed.
And here is where Node.js checks for Z_STREAM_END!

It looks like it does, since the documentation lists zlib.constants.Z_STREAM_END as a possible return value.

using puff.c with deflate .zip

I have a little project with an STM32 where I send a file via UART and store it at a defined address in the flash memory. This works. Now I want to modify this and store a compressed file in the flash and uncompress it to a defined address somewhere else in the flash. I use 7zip to compress the file as .zip type with deflate method. As I understand correctly the data in the zip file are after the file name and the extra field. So I use a offset value in the local int bits(struct state *s, int need) function. After the start of the puff function I get last = 0 and type = 2 which looks ok. But in the while function in local int dynamic(struct state *s) I get a hard fault. So my questions are:
Is it ok for me to use a .zip file with deflate method?
Is it correct that the deflate data (including the 3 block bits) starts after the extra field?
Regards,
Tobias

Yes, you can use zip to create the file, making sure you use the deflate compression method (the zip format permits other methods), and then puff to decompress the compressed data. However you then have to write some code to find the start of the compressed data, which is not necessarily just a simple offset. You need to decode the local header of the first entry of the zip file, which has variable-length fields. You can find the zip file format documented here.
Since you are compressing just one file, it would be simpler and the result smaller to use gzip instead of zip. The header is simpler, and easier to decode, but still has variable-length fields. Better still would be to use zlib to compress to a zlib stream, which has a fixed-length two-byte header, for the most compact and simple-to-process format. zlib is not a utility, but rather a compression and decompression library for which you would write your own code to do the compression.
In all cases you should check the integrity of the decompressed data using the check values in the respective format (zip, gzip, or zlib).
Also you have the option of using inflate from zlib instead of puff. Puff may be best for your embedded application, since it is small both in code size and memory required for decompression. However if speed is important, inflate is a fair bit faster than puff, at the expense of more code and more memory required.

nodejs crypto module, does hash.update() store all input in memory

I have an API route that proxies a file upload from the browser/client to AWS S3.
This API route attempts to stream the file as it is uploaded to avoid buffering the entire contents of the file in memory on the server.
However, the route also attempts to calculate an MD5 checksum of the file's body. As each part of the file is chunked, the hash.update() method is invoked w/ the chunk.
http://nodejs.org/api/crypto.html#crypto_hash_update_data_input_encoding
var crypto = require('crypto');
var hash = crypto.createHash('md5');
function write (chunk) {
// invoked many times as file is uploaded
hash.update(chunk);
}
function done() {
// will hash buffer all chunks in memory at this point?
hash.digest('hex');
}
Will the instance of Hash buffer all the contents of the file in order to perform the hash calculation (thus defeating the goal of avoiding buffering the entire file's contents in memory)? Or can an MD5 hash be calculated incrementally, without ever having the entire input available to perform the calculation?

MD5 and some other hash functions are based on the Merkle–Damgård construction. It supports the incremental/progressive/streaming hashing of data. After the data is transformed into an internal state (which has a fixed size) a last finalization step is performed to generate the final hash by padding and processing the last block and afterwards by simply returning the final state.
This is probably also why many hashing library functions are designed in such a way with an update and a finalization step.
To answer your question: No, the file content is not kept in a buffer, but is rather transformed into a fixed size internal state.

All modern cryptographic hash functions are created in such a way that they can be updated incrementally.
To allow for incremental updates, the input data of the message is first arranged in blocks. These blocks are processed in order. To do this the implementation usually buffers the input internally until it has a full block, and then processes this block together with the current state to produce a new state, using a so called compression function. The initial state usually simply consists of predetermined constant values. During the call to digest the last block is padded - usually with bit padding and an encoding of the amount of processed bytes - and the final state is calculated; this may require an additional block without any message data. A final operation may be performed and finally the resulting hash value is returned.
For MD5 the Merkle–Damgård construction is used. This common construction is also used for SHA-1 and SHA-2. SHA-2 is a family of hashes based on the algorithms for SHA-256 (SHA-224) and SHA-512 (SHA-384, SHA-512/224 and SHA-512/256). MD5 in particular uses a block size of 512 bits and a internal state of 128 bits. The internal state of the last block (including padding) is simply output directly without any post-processing for MD5, SHA-1, SHA-256 and SHA-512.
Keccak has been chosen to be SHA-3. It is construction based on a sponge, a specific compression function. It isn't a Merkle–Damgård hash - which is a big reason why it has been chosen as SHA-3. It still has all the update properties of Merkle–Damgård hashes and has been designed to be compatible with SHA-2. It splits up and buffers blocks just like the previously mentioned hashes, but it has a larger internal state and performs final operations on the output, making it arguably more secure.
So when you were using a modern hash construction such as MD5 you were unknowingly performing additional buffering. Fortunately, the buffering of a single block of 512 bits + 128 bits for the state size will not likely make you run out of memory. It is certainly not required for the hash implementation to buffer the entire message before the final hash value can be calculated.
Notes:
MD5 and SHA-1 are considered insecure w.r.t. collision resistance and they should preferably not be used anymore, especially when it comes to validating contents;
A "compression function" is a specific cryptographic notion; it is not
LSZIP or anything similar;
There may be specialized, theoretical hashes that perform the calculate the values differently - theoretically speaking there is no requirement to split the input messages into blocks and operate on the blocks sequentially. No worry, those are unlikely to be in the libraries you are using;
Similarly, implementations may decide to buffer more blocks at once, but that is fortunately extremely uncommon as well. Commonly only one block is used as buffer - in some cases it could be more performant to buffer a few blocks instead;
Some low level implementations may require you to supply the blocks yourself for reasons of efficiency.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan

Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.

In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.

The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string