I'm trying to understand how the gzip's "-c" option works.
gzip -c
Usage example:
gzip -c ${micro} >> ${macro}.gz
So this will concatenate micro into macro.gz, but whats the workflow?
will it first temporarely gunzip macro.gz, append micro, and then gzip again?
Its important for me to understand this, i have to do some jobs where i dont have a lot of eden space available and therefore all has to be gzipped and never decompressed.
Thx
First, if the data will never be decompressed, then you can send it to /dev/null instead.
To answer your question, no, it will not gunzip macro.gz. It will simply append a gzip stream to macro.gz. Per the standard, a concatenation of gzip streams is a valid gzip stream, so gunzipping that concatenation will give you the concatenation of the uncompressed inputs. That is, if in fact you do want to decompress it someday.
Related
I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)
I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.
Is there a way to gunzip only the last part of the file?
tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less
does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?
No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.
Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).
BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.
I will consider this for the future.
So, getting right to the point for a script I'm making I need to make a custom function that can take compressed data from STDIN and pipe the uncompressed data to STDOUT regardless of the type of compression.
Example:
blah blah decryption stuff | custom_uncompress | other_program
With gzip I could do: "gzip -d -c" or for lzo "lzop -d -c" but I don't know what compression it has and cannot read the magic number from the file because it's encrypted.
As others have already noted in comment, it is impossible to uncompress data if one don't even know in what compressed format it is. The only best attempt one can do is to capture first couple of bytes from data stream and "guess" the result among selected common compression formats. When original data is compressed with any method not handled with this magic then it's out of luck.
This question is too unreasonable to have a proper answer.
q1) Can i check if I do a
gzip -c file | encrypt (some parameters)
a) does gzip print out the output line by line and pipe it to the encrypt function or
b) gzip will be perform 1st, then the output will be pipe all at once to the encrypt function ?
====================================================
q2) Will performing gzip | encrypt have any better performance considerations then gzip, then encrypt
Regards,
Noob
Gzip is a streaming compressor/decompressor. So (for large enough inputs) the compressor/decompressor starts writing output before it has seen the whole input.
That's one of the reasons gzip compression is used for HTTP compression. The sender can compress while it's still generating content; the recipient can work on decompressing the first part of the content, while still receiving the rest.
Gzip does not work "line-by-line", because it doesn't know what a line is. But it does work "chunk-by-chunk", where the compressor defines the size of the chunk.
"Performance" is too vague a word, and too complex an area to give a yes or no answer.
With gzip -c file | encrypt, for a large enough file, will see encrypt and gzip working concurrently. That is, encrypt will be encrypting the first compressed block before gzip compresses the last chunk of file.
The size of a pipe buffer is implementation dependent. Under SunOS, it's 4kB. That is: gunzip < file.gz | encrypt will move in 4k chunks. Again, it depends on the OS. CygWIN might behave completely differently.
I should add that this is in man 7 pipe. Search for PIPE_BUF.
I'm trying to understand how a named pipe behaves in terms of performance. Say I have a large file I am decompressing that I want to write to a named pipe (/tmp/data):
gzip --stdout -d data.gz > /tmp/data
and then I sometime later run a program that reads from the pipe:
wc -l /tmp/data
When does gzip actually decompress the data, when I run the first command, or when I run the second and the reader attaches to the pipe? If the former, is the data stored on disk or in memory?
Pipes (named or otherwise) have only a very small buffer if any -- so if nothing is reading, then nothing (or very little) can be written.
In your example, gzip will do very little until wc is run, because before that point its efforts to write output will block. Out-of-the-box there is no nontrivial buffer either on-disk or in-memory, though tools exist which will implement such a buffer for you, should you want one -- see pv with its -B argument, or the no-longer-maintained (and, sadly, removed from Debian by folks who didn't understand its function) bfr.
How to corrupt header of tar.gz for testing purpose? So that when the application tries to unzip it ... it fails.
Thanks
It's awfully simple to create a file that gzip won't recognize:
dd if=/dev/urandom bs=1024 count=1 of=bad.tar.gz
While of course it's possible to create a valid gzip file with /dev/urandom, it's about as likely as being struck by lightning. Under a clear sky.
Get yourself a hex editor, that previous questions recommends bless.
You can try arbitrarily changing bits but if you want to be more surgical take a look at the gzip spec which can tell you exactly which bits to flip on the outer gzip header. Or try the tar specification.
There are checksums embedded in gzip files, those may be a good first choice to change:
If FHCRC is set, a CRC16 for the gzip header is present, immediately
before the compressed data. The CRC16 consists of the two least
significant bytes of the CRC32 for all bytes of the gzip header up to
and not including the CRC16