using puff.c with deflate .zip - zip

I have a little project with an STM32 where I send a file via UART and store it at a defined address in the flash memory. This works. Now I want to modify this and store a compressed file in the flash and uncompress it to a defined address somewhere else in the flash. I use 7zip to compress the file as .zip type with deflate method. As I understand correctly the data in the zip file are after the file name and the extra field. So I use a offset value in the local int bits(struct state *s, int need) function. After the start of the puff function I get last = 0 and type = 2 which looks ok. But in the while function in local int dynamic(struct state *s) I get a hard fault. So my questions are:
Is it ok for me to use a .zip file with deflate method?
Is it correct that the deflate data (including the 3 block bits) starts after the extra field?
Regards,
Tobias

Yes, you can use zip to create the file, making sure you use the deflate compression method (the zip format permits other methods), and then puff to decompress the compressed data. However you then have to write some code to find the start of the compressed data, which is not necessarily just a simple offset. You need to decode the local header of the first entry of the zip file, which has variable-length fields. You can find the zip file format documented here.
Since you are compressing just one file, it would be simpler and the result smaller to use gzip instead of zip. The header is simpler, and easier to decode, but still has variable-length fields. Better still would be to use zlib to compress to a zlib stream, which has a fixed-length two-byte header, for the most compact and simple-to-process format. zlib is not a utility, but rather a compression and decompression library for which you would write your own code to do the compression.
In all cases you should check the integrity of the decompressed data using the check values in the respective format (zip, gzip, or zlib).
Also you have the option of using inflate from zlib instead of puff. Puff may be best for your embedded application, since it is small both in code size and memory required for decompression. However if speed is important, inflate is a fair bit faster than puff, at the expense of more code and more memory required.

Related

Reading ("tailing") the end of a huge (>300GB) gzipped text file

I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)
I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.
Is there a way to gunzip only the last part of the file?
tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less
does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?
No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.
Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).
BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.
I will consider this for the future.

Node zlib incremental inflate

I've located the end of a local file header in download stream of a large zip file that
specifies deflate compression with
bit 3 set indicating the length of the compressed data follows the compressed data
and would like to now inflate that data using Node zlib but I cannot figure out how to feed data into zlib and receive feedback telling me when the deflate stream has self terminated.
Does Node's zlib library support consuming chunks of deflate data and returning a result letting the caller know when the deflate stream has ended?
Or is this an insane thing to do because it would imply I'm inflating on the UI thread and what I should really do is save the downloaded file and once downloaded use an NPM package? Hm.. well.. either the network is faster than inflation in which case streaming inflation would slow the network (bummer) or the network is slower than streaming inflation so why deflate while streaming (which I can't figure out how to do anyway) when I could simply saving to disk and reload-deflate while I'm sitting around waiting for the network..
Still, for my edification, I'd still like to know if Node supports streaming inflation.
var zlib = require('zlib')
var data = bufferOfChunkOfDeflatedData
var inflate = zlib.createInflate();
var stream = inflate.pipe(fs.createWriteStream(path));
var result = stream.write(data);
// but result doesn't indicate if the inflate stream has terminated...
Describes deflate headers and how they encode the length of the stream:
https://www.bolet.org/~pornin/deflate-flush-fr.html
In memory stream:
https://www.npmjs.com/package/memory-streams
Well, this guy just pulls till he hits the magic signature! :) https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L152
Node code for configuring convenience method:
https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L83
Specifically: https://nodejs.org/api/zlib.html#zlib_zlib_inflateraw_buffer_options_callback
Eh, looks like node is setup to return the decompressed buffer as one block to the callback; Doesn't look like node is setup to figure out the end of the deflate stream.
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback says The callback function must be called only when the current chunk is completely consumed. and here's the spot where it passes the chunk to zlib https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L358. So there's no opportunity to say the stream was partially consumed..
But then again... https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L161 but not really. Also just checks for the magic sig: https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L153
And from the zip spec:
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
So looks like everyone just looks for the sig.
Mark says that's a no-no... So don't do that. And know that if your using an NPM lib to unzip, then there's a good chance the lib is doing that. To do it right would require, I think, grocking this from the zlib API docs: https://zlib.net/manual.html
The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.
This seems to indicate the final compressed bit will not be byte aligned. Yet the ZIP spec seems to indicate header that starts with the magic sig, the one everyone is using but shouldn't, is byte aligned: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.9.1 This descriptor MUST exist if bit 3 of the general
purpose bit flag is set (see below). It is byte aligned
and immediately follows the last byte of compressed data.
This descriptor SHOULD be used only when it was not possible to
seek in the output .ZIP file, e.g., when the output .ZIP file
was standard output or a non-seekable device. For ZIP64(tm) format
archives, the compressed and uncompressed sizes are 8 bytes each.
How can the end of the deflate stream not be byte aligned but the following Data descriptor be byte aligned?
Is there a nice reference implementation?
Reference impl using Inflate with Z_BLOCK: https://github.com/madler/zlib/blob/master/examples/gzappend.c
This guys reads backwards to pull out the directory: https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 Is this necessary?
This guy seems to think that zips cannot be inflated without reading the whole file to get to the directory: https://www.npmjs.com/package/yauzl#no-streaming-unzip-api
I don't see why that would be the case. The streams describe their length... and Mark verifies they can be streamed.
And here is where Node.js checks for Z_STREAM_END!
It looks like it does, since the documentation lists zlib.constants.Z_STREAM_END as a possible return value.

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Compressed file size after deflate

I am using deflate function in zlib library to compress the file. How can I determine the size of the compressed file? Is it the element total_out that indicates the size of compressed file?
If you are using deflate() correctly, then you are accumulating or writing the compressed output, and can add up the number of output bytes yourself. At each call, the amount of output is strm.avail_out before the deflate() call minus strm.avail_out after the call. See zpipe.c for an example of the usage of deflate() and inflate().
You can use strm.total_out for the total size of the compressed output if you know that that size will fit in an unsigned long.

Resources