gzip and pipe to output (performance consideration) - linux

q1) Can i check if I do a
gzip -c file | encrypt (some parameters)
a) does gzip print out the output line by line and pipe it to the encrypt function or
b) gzip will be perform 1st, then the output will be pipe all at once to the encrypt function ?
====================================================
q2) Will performing gzip | encrypt have any better performance considerations then gzip, then encrypt
Regards,
Noob

Gzip is a streaming compressor/decompressor. So (for large enough inputs) the compressor/decompressor starts writing output before it has seen the whole input.
That's one of the reasons gzip compression is used for HTTP compression. The sender can compress while it's still generating content; the recipient can work on decompressing the first part of the content, while still receiving the rest.
Gzip does not work "line-by-line", because it doesn't know what a line is. But it does work "chunk-by-chunk", where the compressor defines the size of the chunk.
"Performance" is too vague a word, and too complex an area to give a yes or no answer.
With gzip -c file | encrypt, for a large enough file, will see encrypt and gzip working concurrently. That is, encrypt will be encrypting the first compressed block before gzip compresses the last chunk of file.

The size of a pipe buffer is implementation dependent. Under SunOS, it's 4kB. That is: gunzip < file.gz | encrypt will move in 4k chunks. Again, it depends on the OS. CygWIN might behave completely differently.
I should add that this is in man 7 pipe. Search for PIPE_BUF.

Related

Reading ("tailing") the end of a huge (>300GB) gzipped text file

I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)
I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.
Is there a way to gunzip only the last part of the file?
tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less
does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?
No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.
Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).
BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.
I will consider this for the future.

Custom uncompress function BASH

So, getting right to the point for a script I'm making I need to make a custom function that can take compressed data from STDIN and pipe the uncompressed data to STDOUT regardless of the type of compression.
Example:
blah blah decryption stuff | custom_uncompress | other_program
With gzip I could do: "gzip -d -c" or for lzo "lzop -d -c" but I don't know what compression it has and cannot read the magic number from the file because it's encrypted.
As others have already noted in comment, it is impossible to uncompress data if one don't even know in what compressed format it is. The only best attempt one can do is to capture first couple of bytes from data stream and "guess" the result among selected common compression formats. When original data is compressed with any method not handled with this magic then it's out of luck.
This question is too unreasonable to have a proper answer.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

How gxip -c works

I'm trying to understand how the gzip's "-c" option works.
gzip -c
Usage example:
gzip -c ${micro} >> ${macro}.gz
So this will concatenate micro into macro.gz, but whats the workflow?
will it first temporarely gunzip macro.gz, append micro, and then gzip again?
Its important for me to understand this, i have to do some jobs where i dont have a lot of eden space available and therefore all has to be gzipped and never decompressed.
Thx
First, if the data will never be decompressed, then you can send it to /dev/null instead.
To answer your question, no, it will not gunzip macro.gz. It will simply append a gzip stream to macro.gz. Per the standard, a concatenation of gzip streams is a valid gzip stream, so gunzipping that concatenation will give you the concatenation of the uncompressed inputs. That is, if in fact you do want to decompress it someday.

How to corrupt header of tar.gz for testing purpose

How to corrupt header of tar.gz for testing purpose? So that when the application tries to unzip it ... it fails.
Thanks
It's awfully simple to create a file that gzip won't recognize:
dd if=/dev/urandom bs=1024 count=1 of=bad.tar.gz
While of course it's possible to create a valid gzip file with /dev/urandom, it's about as likely as being struck by lightning. Under a clear sky.
Get yourself a hex editor, that previous questions recommends bless.
You can try arbitrarily changing bits but if you want to be more surgical take a look at the gzip spec which can tell you exactly which bits to flip on the outer gzip header. Or try the tar specification.
There are checksums embedded in gzip files, those may be a good first choice to change:
If FHCRC is set, a CRC16 for the gzip header is present, immediately
before the compressed data. The CRC16 consists of the two least
significant bytes of the CRC32 for all bytes of the gzip header up to
and not including the CRC16

Resources