How to modify a gzip compressed file - linux

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan

Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.

In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.

The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.
The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

Reading ("tailing") the end of a huge (>300GB) gzipped text file

I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)
I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.
Is there a way to gunzip only the last part of the file?
tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less
does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?
No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.
Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).
BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.
I will consider this for the future.

Squashfs check compressed file size

is there any way to check the final size of a specific file after compression in a squashfs filesystem?
I'm looking through mksquashfs/unsquashfs command line options but I can't find anything.
Using the -info option in mksquashfs only prints the size before the compression.
Thanks
This isn't feasible to do with much granularity, because compression is done at block level, not file level.
A file may be marked at starting 50kb into the size of the buffer created by decompressing block 50, and continuing to end 50 bytes into the decompressed block 52 (ignoring fragments here, which are a separate concern) -- but that doesn't let you map back to the position inside the compressed copy of block-50 where that file starts. (You can easily determine the compression ratio for block 51, but you can't easily figure out the ratios for the parts of the file contained in 50 and 52 in our example, because they're shared with other contents).
So the information isn't exposed because it isn't easily available. This actually makes storage of numerous (similar) small files significantly more efficient, because a single compression context is used for all of them (and decompressing a block to retrieve one file may mean that you've got files next to it already decompressed in memory)... but without potentially-unfounded assumptions (such as assuming that all contents within a block share that block's average ratio) it doesn't help with trying to backtrace how well each individual item compressed, because the items aren't compressed individually in the first place.

using puff.c with deflate .zip

I have a little project with an STM32 where I send a file via UART and store it at a defined address in the flash memory. This works. Now I want to modify this and store a compressed file in the flash and uncompress it to a defined address somewhere else in the flash. I use 7zip to compress the file as .zip type with deflate method. As I understand correctly the data in the zip file are after the file name and the extra field. So I use a offset value in the local int bits(struct state *s, int need) function. After the start of the puff function I get last = 0 and type = 2 which looks ok. But in the while function in local int dynamic(struct state *s) I get a hard fault. So my questions are:
Is it ok for me to use a .zip file with deflate method?
Is it correct that the deflate data (including the 3 block bits) starts after the extra field?
Regards,
Tobias
Yes, you can use zip to create the file, making sure you use the deflate compression method (the zip format permits other methods), and then puff to decompress the compressed data. However you then have to write some code to find the start of the compressed data, which is not necessarily just a simple offset. You need to decode the local header of the first entry of the zip file, which has variable-length fields. You can find the zip file format documented here.
Since you are compressing just one file, it would be simpler and the result smaller to use gzip instead of zip. The header is simpler, and easier to decode, but still has variable-length fields. Better still would be to use zlib to compress to a zlib stream, which has a fixed-length two-byte header, for the most compact and simple-to-process format. zlib is not a utility, but rather a compression and decompression library for which you would write your own code to do the compression.
In all cases you should check the integrity of the decompressed data using the check values in the respective format (zip, gzip, or zlib).
Also you have the option of using inflate from zlib instead of puff. Puff may be best for your embedded application, since it is small both in code size and memory required for decompression. However if speed is important, inflate is a fair bit faster than puff, at the expense of more code and more memory required.

Using sed on a compressed file

I have written a file processing program and now it needs to read from a zipped file(.gz unzipped file may get as large as 2TB),
Is there a sed equivalent for zipped files like (zcat/cat) or else what would be the best approach to do the following efficiently
ONE=`zcat filename.gz| sed -n $counts`
$counts : counter to read(line by line)
The above method works, but is quite slow for large file as I need to read each line and perform the matching on certain fields.
Thanks
EDIT
Though not directly helpful, here are a set of zcommands
http://www.cyberciti.biz/tips/decompress-and-expand-text-files.html
Well you either can have more speed (i.e. use uncompressed files) or more free space (i.e. use compressed files and the pipe you showed)... sorry. Using compressed files will always have an overhead.
If you understand the internal structure of the compression format it is possible that you could write a pattern matcher that can operate on compressed data without fully decompressing it, but instead by simply determining from the compressed data if the pattern would be present in a given piece of decompressed data.
If the pattern has any complexity at all this sounds like quite a complicated project as you'd have to handle cases where the pattern could be satisfied by the combination of output from two (or more) separate pieces of decompression.

Resources