Node zlib incremental inflate - node.js

I've located the end of a local file header in download stream of a large zip file that
specifies deflate compression with
bit 3 set indicating the length of the compressed data follows the compressed data
and would like to now inflate that data using Node zlib but I cannot figure out how to feed data into zlib and receive feedback telling me when the deflate stream has self terminated.
Does Node's zlib library support consuming chunks of deflate data and returning a result letting the caller know when the deflate stream has ended?
Or is this an insane thing to do because it would imply I'm inflating on the UI thread and what I should really do is save the downloaded file and once downloaded use an NPM package? Hm.. well.. either the network is faster than inflation in which case streaming inflation would slow the network (bummer) or the network is slower than streaming inflation so why deflate while streaming (which I can't figure out how to do anyway) when I could simply saving to disk and reload-deflate while I'm sitting around waiting for the network..
Still, for my edification, I'd still like to know if Node supports streaming inflation.
var zlib = require('zlib')
var data = bufferOfChunkOfDeflatedData
var inflate = zlib.createInflate();
var stream = inflate.pipe(fs.createWriteStream(path));
var result = stream.write(data);
// but result doesn't indicate if the inflate stream has terminated...
Describes deflate headers and how they encode the length of the stream:
https://www.bolet.org/~pornin/deflate-flush-fr.html
In memory stream:
https://www.npmjs.com/package/memory-streams
Well, this guy just pulls till he hits the magic signature! :) https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L152
Node code for configuring convenience method:
https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L83
Specifically: https://nodejs.org/api/zlib.html#zlib_zlib_inflateraw_buffer_options_callback
Eh, looks like node is setup to return the decompressed buffer as one block to the callback; Doesn't look like node is setup to figure out the end of the deflate stream.
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback says The callback function must be called only when the current chunk is completely consumed. and here's the spot where it passes the chunk to zlib https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L358. So there's no opportunity to say the stream was partially consumed..
But then again... https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L161 but not really. Also just checks for the magic sig: https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L153
And from the zip spec:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
So looks like everyone just looks for the sig.
Mark says that's a no-no... So don't do that. And know that if your using an NPM lib to unzip, then there's a good chance the lib is doing that. To do it right would require, I think, grocking this from the zlib API docs: https://zlib.net/manual.html
The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.
This seems to indicate the final compressed bit will not be byte aligned. Yet the ZIP spec seems to indicate header that starts with the magic sig, the one everyone is using but shouldn't, is byte aligned: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.9.1 This descriptor MUST exist if bit 3 of the general
purpose bit flag is set (see below). It is byte aligned
and immediately follows the last byte of compressed data.
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device. For ZIP64(tm) format
archives, the compressed and uncompressed sizes are 8 bytes each.
How can the end of the deflate stream not be byte aligned but the following Data descriptor be byte aligned?
Is there a nice reference implementation?
Reference impl using Inflate with Z_BLOCK: https://github.com/madler/zlib/blob/master/examples/gzappend.c
This guys reads backwards to pull out the directory: https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 Is this necessary?
This guy seems to think that zips cannot be inflated without reading the whole file to get to the directory: https://www.npmjs.com/package/yauzl#no-streaming-unzip-api
I don't see why that would be the case. The streams describe their length... and Mark verifies they can be streamed.
And here is where Node.js checks for Z_STREAM_END!

It looks like it does, since the documentation lists zlib.constants.Z_STREAM_END as a possible return value.

Related

using puff.c with deflate .zip

I have a little project with an STM32 where I send a file via UART and store it at a defined address in the flash memory. This works. Now I want to modify this and store a compressed file in the flash and uncompress it to a defined address somewhere else in the flash. I use 7zip to compress the file as .zip type with deflate method. As I understand correctly the data in the zip file are after the file name and the extra field. So I use a offset value in the local int bits(struct state *s, int need) function. After the start of the puff function I get last = 0 and type = 2 which looks ok. But in the while function in local int dynamic(struct state *s) I get a hard fault. So my questions are:
Is it ok for me to use a .zip file with deflate method?
Is it correct that the deflate data (including the 3 block bits) starts after the extra field?
Regards,
Tobias
Yes, you can use zip to create the file, making sure you use the deflate compression method (the zip format permits other methods), and then puff to decompress the compressed data. However you then have to write some code to find the start of the compressed data, which is not necessarily just a simple offset. You need to decode the local header of the first entry of the zip file, which has variable-length fields. You can find the zip file format documented here.
Since you are compressing just one file, it would be simpler and the result smaller to use gzip instead of zip. The header is simpler, and easier to decode, but still has variable-length fields. Better still would be to use zlib to compress to a zlib stream, which has a fixed-length two-byte header, for the most compact and simple-to-process format. zlib is not a utility, but rather a compression and decompression library for which you would write your own code to do the compression.
In all cases you should check the integrity of the decompressed data using the check values in the respective format (zip, gzip, or zlib).
Also you have the option of using inflate from zlib instead of puff. Puff may be best for your embedded application, since it is small both in code size and memory required for decompression. However if speed is important, inflate is a fair bit faster than puff, at the expense of more code and more memory required.

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

How to synchronously read from a ReadStream in node

I am trying to read UTF-8 text from a file in a memory and time efficient way. There are two ways to read directly from a file synchronously:
fs.readFileSync will read the entire file and return a buffer containing the file's entire contents
fs.readSync will read a set amount of bytes from a file and return a buffer containing just those contents
I initially just used fs.readFileSync because it's easiest, but I'd like to be able to efficiently handle potentially large files by only reading in chunks of text at a time. So I started using fs.readSync instead. But then I realized that fs.readSync doesn't handle UTF-8 decoding. UTF-8 is simple, so I could whip up some logic to manually decode it, but Node already has services for that, so I'd like to avoid that if possible.
I noticed fs.createReadStream, which returns a ReadStream that can be used for exactly this purpose, but unfortunately it seems to only be available in an asynchronous mode of operation.
Is there a way to read from a ReadStream in a synchronous way? I have a massive stack built on top of this already, and I'd rather not have to refactor it to be asynchronous.
I discovered the string_decoder module, which handles all that UTF-8 decoding logic I was worried I'd have to write. At this point, it seems like a no-brainer to use this on top of fs.readSync to get the synchronous behavior I was looking for.
You basically just keep feeding bytes to it, and as it is able to successfully decode characters, it will emit them. The Node documentation is sufficient at describing how it works.

Deflating data from MSZIP format

I'm trying to read a compressed binary .x mesh file but my decompression is failing. The file is basically some directx header info and then a bunch of data in MSZIP format (i.e. 2 bytes are an int blockSize, 2 bytes are a "magic number" and then there are blockSize deflated bytes and then repeat until there's no more data) so for each block I'm just getting the compressed bytes and deflating like so-
internal static byte[] DecompressBlock(byte[] data) {
var result = new List<byte>();
var ms = new MemoryStream(data);
var ds = new DeflateStream(ms, CompressionMode.Decompress);
var newStream = new MemoryStream();
ds.CopyTo(newStream);
ds.Flush();
ds.Close();
return newStream.GetBuffer();
}
The first block deflates as expected. Subsequent blocks are the right inflated size but, seemingly randomly, some bytes are 0 when they shouldn't be, usually in in groups of 4-12.
How I can deflate different blocks of zipped data while maintaining the same history buffer?
UPDATE: After a little more research it looks like in MSZIP compression these blocks ARE the results of separate deflate operations but the "history buffer" is maintained between them I don't know if deflatestream is gonna be able to handle this. updated the actual question
Yes, there is something that you are missing. Each deflate block can and does use history from the previous deflate block. So at each block you must initialize the deflate dictionary with the last 32K of uncompressed data from the previous blocks.
If anyone is trying to read/write the MSZIP format for DirectX meshes using C# only, I found out it is possible to do it with SharpZipLib.
For reference, the format of the compressed DirectX file is the following:
16 bytes -> DirectX header
4 bytes -> total uncompressed file size (including the 16 bytes of its header)
2 bytes -> size of uncompressed block, up to 32KB
2 bytes -> size of compressed block, plus 2 (because it includes the magic number's size)
2 bytes -> magic number ("CK" in ascii)
a compressed block
Parts 3-6 are repeated until the end of file. In practice all blocks but the last will be 32KB when uncompressed.
To decompress the file, you will have to implement the logic that extracts each compressed block, then give them to SharpZipLib's Zip.Compression.Inflater class. Reuse the same inflater for all blocks, but call its Reset method between each block.
This partly works, but you may get the "broken uncompressed block" error for some blocks. To overcome this problem, you will have to modify SharpZipLib's source, specifically the file Inflater.cs. The change is trivial - all you have to do is to disable/skip the DECODE_STORED_LEN2 case and go directly to DECODE_STORED instead.
To compress a file, split its contents into 32KB blocks, and the same logic applies: feed each block to the same Deflater and call Reset between each call. Again, you will have to modify the file DeflaterHuffman.cs, removing the line pending.WriteShort(~storedLength).

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Resources