Compressed file size after deflate - zip

I am using deflate function in zlib library to compress the file. How can I determine the size of the compressed file? Is it the element total_out that indicates the size of compressed file?

If you are using deflate() correctly, then you are accumulating or writing the compressed output, and can add up the number of output bytes yourself. At each call, the amount of output is strm.avail_out before the deflate() call minus strm.avail_out after the call. See zpipe.c for an example of the usage of deflate() and inflate().
You can use strm.total_out for the total size of the compressed output if you know that that size will fit in an unsigned long.

Related

When to form a new DEFLATE block?

When compressing a file or directory into a zip file using DEFLATE, when should a new DEFLATE block be formed? Furthermore, since the maximum code length is 15 bits in DEFLATE, should a new block be formed whenever the Huffman tree exceeds a depth of 15? Thanks!
Whenever you like, but not too often.
No. You can squash the Huffman tree.
zlib emits a deflate block once a selected number of literals + length/distance pairs have been generated. By default, that number is 16383. It can be changed as part of memory usage option. At the end, the last block has whatever remains.
zopfli tries to be more intelligent by making large blocks and splitting them so long as the compression ratio goes up, stopping when the the next split would make the compression ratio go down.
You don't want deflate blocks to be too small, because then the size of the dynamic header describing the codes used in the block will become a significant factor in the size, reducing the compression ratio. You don't want the blocks to be too large, because then the codes, fixed for the duration of the block, will not be able to adapt to local statistical variations in the data being compressed.
As for the maximum depth, zlib and other deflators will happily make blocks for which a code has a depth greater than 15 by the normal Huffman algorithm. They will then squash the code down to make the depth 15.

Node zlib incremental inflate

I've located the end of a local file header in download stream of a large zip file that
specifies deflate compression with
bit 3 set indicating the length of the compressed data follows the compressed data
and would like to now inflate that data using Node zlib but I cannot figure out how to feed data into zlib and receive feedback telling me when the deflate stream has self terminated.
Does Node's zlib library support consuming chunks of deflate data and returning a result letting the caller know when the deflate stream has ended?
Or is this an insane thing to do because it would imply I'm inflating on the UI thread and what I should really do is save the downloaded file and once downloaded use an NPM package? Hm.. well.. either the network is faster than inflation in which case streaming inflation would slow the network (bummer) or the network is slower than streaming inflation so why deflate while streaming (which I can't figure out how to do anyway) when I could simply saving to disk and reload-deflate while I'm sitting around waiting for the network..
Still, for my edification, I'd still like to know if Node supports streaming inflation.
var zlib = require('zlib')
var data = bufferOfChunkOfDeflatedData
var inflate = zlib.createInflate();
var stream = inflate.pipe(fs.createWriteStream(path));
var result = stream.write(data);
// but result doesn't indicate if the inflate stream has terminated...
Describes deflate headers and how they encode the length of the stream:
https://www.bolet.org/~pornin/deflate-flush-fr.html
In memory stream:
https://www.npmjs.com/package/memory-streams
Well, this guy just pulls till he hits the magic signature! :) https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L152
Node code for configuring convenience method:
https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L83
Specifically: https://nodejs.org/api/zlib.html#zlib_zlib_inflateraw_buffer_options_callback
Eh, looks like node is setup to return the decompressed buffer as one block to the callback; Doesn't look like node is setup to figure out the end of the deflate stream.
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback says The callback function must be called only when the current chunk is completely consumed. and here's the spot where it passes the chunk to zlib https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L358. So there's no opportunity to say the stream was partially consumed..
But then again... https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L161 but not really. Also just checks for the magic sig: https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L153
And from the zip spec:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
So looks like everyone just looks for the sig.
Mark says that's a no-no... So don't do that. And know that if your using an NPM lib to unzip, then there's a good chance the lib is doing that. To do it right would require, I think, grocking this from the zlib API docs: https://zlib.net/manual.html
The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.
This seems to indicate the final compressed bit will not be byte aligned. Yet the ZIP spec seems to indicate header that starts with the magic sig, the one everyone is using but shouldn't, is byte aligned: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.9.1 This descriptor MUST exist if bit 3 of the general
purpose bit flag is set (see below). It is byte aligned
and immediately follows the last byte of compressed data.
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device. For ZIP64(tm) format
archives, the compressed and uncompressed sizes are 8 bytes each.
How can the end of the deflate stream not be byte aligned but the following Data descriptor be byte aligned?
Is there a nice reference implementation?
Reference impl using Inflate with Z_BLOCK: https://github.com/madler/zlib/blob/master/examples/gzappend.c
This guys reads backwards to pull out the directory: https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 Is this necessary?
This guy seems to think that zips cannot be inflated without reading the whole file to get to the directory: https://www.npmjs.com/package/yauzl#no-streaming-unzip-api
I don't see why that would be the case. The streams describe their length... and Mark verifies they can be streamed.
And here is where Node.js checks for Z_STREAM_END!
It looks like it does, since the documentation lists zlib.constants.Z_STREAM_END as a possible return value.

using puff.c with deflate .zip

I have a little project with an STM32 where I send a file via UART and store it at a defined address in the flash memory. This works. Now I want to modify this and store a compressed file in the flash and uncompress it to a defined address somewhere else in the flash. I use 7zip to compress the file as .zip type with deflate method. As I understand correctly the data in the zip file are after the file name and the extra field. So I use a offset value in the local int bits(struct state *s, int need) function. After the start of the puff function I get last = 0 and type = 2 which looks ok. But in the while function in local int dynamic(struct state *s) I get a hard fault. So my questions are:
Is it ok for me to use a .zip file with deflate method?
Is it correct that the deflate data (including the 3 block bits) starts after the extra field?
Regards,
Tobias
Yes, you can use zip to create the file, making sure you use the deflate compression method (the zip format permits other methods), and then puff to decompress the compressed data. However you then have to write some code to find the start of the compressed data, which is not necessarily just a simple offset. You need to decode the local header of the first entry of the zip file, which has variable-length fields. You can find the zip file format documented here.
Since you are compressing just one file, it would be simpler and the result smaller to use gzip instead of zip. The header is simpler, and easier to decode, but still has variable-length fields. Better still would be to use zlib to compress to a zlib stream, which has a fixed-length two-byte header, for the most compact and simple-to-process format. zlib is not a utility, but rather a compression and decompression library for which you would write your own code to do the compression.
In all cases you should check the integrity of the decompressed data using the check values in the respective format (zip, gzip, or zlib).
Also you have the option of using inflate from zlib instead of puff. Puff may be best for your embedded application, since it is small both in code size and memory required for decompression. However if speed is important, inflate is a fair bit faster than puff, at the expense of more code and more memory required.

Deflating data from MSZIP format

I'm trying to read a compressed binary .x mesh file but my decompression is failing. The file is basically some directx header info and then a bunch of data in MSZIP format (i.e. 2 bytes are an int blockSize, 2 bytes are a "magic number" and then there are blockSize deflated bytes and then repeat until there's no more data) so for each block I'm just getting the compressed bytes and deflating like so-
internal static byte[] DecompressBlock(byte[] data) {
var result = new List<byte>();
var ms = new MemoryStream(data);
var ds = new DeflateStream(ms, CompressionMode.Decompress);
var newStream = new MemoryStream();
ds.CopyTo(newStream);
ds.Flush();
ds.Close();
return newStream.GetBuffer();
}
The first block deflates as expected. Subsequent blocks are the right inflated size but, seemingly randomly, some bytes are 0 when they shouldn't be, usually in in groups of 4-12.
How I can deflate different blocks of zipped data while maintaining the same history buffer?
UPDATE: After a little more research it looks like in MSZIP compression these blocks ARE the results of separate deflate operations but the "history buffer" is maintained between them I don't know if deflatestream is gonna be able to handle this. updated the actual question
Yes, there is something that you are missing. Each deflate block can and does use history from the previous deflate block. So at each block you must initialize the deflate dictionary with the last 32K of uncompressed data from the previous blocks.
If anyone is trying to read/write the MSZIP format for DirectX meshes using C# only, I found out it is possible to do it with SharpZipLib.
For reference, the format of the compressed DirectX file is the following:
16 bytes -> DirectX header
4 bytes -> total uncompressed file size (including the 16 bytes of its header)
2 bytes -> size of uncompressed block, up to 32KB
2 bytes -> size of compressed block, plus 2 (because it includes the magic number's size)
2 bytes -> magic number ("CK" in ascii)
a compressed block
Parts 3-6 are repeated until the end of file. In practice all blocks but the last will be 32KB when uncompressed.
To decompress the file, you will have to implement the logic that extracts each compressed block, then give them to SharpZipLib's Zip.Compression.Inflater class. Reuse the same inflater for all blocks, but call its Reset method between each block.
This partly works, but you may get the "broken uncompressed block" error for some blocks. To overcome this problem, you will have to modify SharpZipLib's source, specifically the file Inflater.cs. The change is trivial - all you have to do is to disable/skip the DECODE_STORED_LEN2 case and go directly to DECODE_STORED instead.
To compress a file, split its contents into 32KB blocks, and the same logic applies: feed each block to the same Deflater and call Reset between each call. Again, you will have to modify the file DeflaterHuffman.cs, removing the line pending.WriteShort(~storedLength).

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Resources