Reading ("tailing") the end of a huge (>300GB) gzipped text file

Reading ("tailing") the end of a huge (>300GB) gzipped text file - linux

I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)
I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.
Is there a way to gunzip only the last part of the file?
tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less
does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?

No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.

Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).
BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.
I will consider this for the future.

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.

The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

Custom uncompress function BASH

So, getting right to the point for a script I'm making I need to make a custom function that can take compressed data from STDIN and pipe the uncompressed data to STDOUT regardless of the type of compression.
Example:
blah blah decryption stuff | custom_uncompress | other_program
With gzip I could do: "gzip -d -c" or for lzo "lzop -d -c" but I don't know what compression it has and cannot read the magic number from the file because it's encrypted.

As others have already noted in comment, it is impossible to uncompress data if one don't even know in what compressed format it is. The only best attempt one can do is to capture first couple of bytes from data stream and "guess" the result among selected common compression formats. When original data is compressed with any method not handled with this magic then it's out of luck.
This question is too unreasonable to have a proper answer.

gzip and pipe to output (performance consideration)

q1) Can i check if I do a
gzip -c file | encrypt (some parameters)
a) does gzip print out the output line by line and pipe it to the encrypt function or
b) gzip will be perform 1st, then the output will be pipe all at once to the encrypt function ?
====================================================
q2) Will performing gzip | encrypt have any better performance considerations then gzip, then encrypt
Regards,
Noob

Gzip is a streaming compressor/decompressor. So (for large enough inputs) the compressor/decompressor starts writing output before it has seen the whole input.
That's one of the reasons gzip compression is used for HTTP compression. The sender can compress while it's still generating content; the recipient can work on decompressing the first part of the content, while still receiving the rest.
Gzip does not work "line-by-line", because it doesn't know what a line is. But it does work "chunk-by-chunk", where the compressor defines the size of the chunk.
"Performance" is too vague a word, and too complex an area to give a yes or no answer.
With gzip -c file | encrypt, for a large enough file, will see encrypt and gzip working concurrently. That is, encrypt will be encrypting the first compressed block before gzip compresses the last chunk of file.

The size of a pipe buffer is implementation dependent. Under SunOS, it's 4kB. That is: gunzip < file.gz | encrypt will move in 4k chunks. Again, it depends on the OS. CygWIN might behave completely differently.
I should add that this is in man 7 pipe. Search for PIPE_BUF.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan

Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.

In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.

The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Using sed on a compressed file

I have written a file processing program and now it needs to read from a zipped file(.gz unzipped file may get as large as 2TB),
Is there a sed equivalent for zipped files like (zcat/cat) or else what would be the best approach to do the following efficiently
ONE=`zcat filename.gz| sed -n $counts`
$counts : counter to read(line by line)
The above method works, but is quite slow for large file as I need to read each line and perform the matching on certain fields.
Thanks
EDIT
Though not directly helpful, here are a set of zcommands
http://www.cyberciti.biz/tips/decompress-and-expand-text-files.html

Well you either can have more speed (i.e. use uncompressed files) or more free space (i.e. use compressed files and the pipe you showed)... sorry. Using compressed files will always have an overhead.

If you understand the internal structure of the compression format it is possible that you could write a pattern matcher that can operate on compressed data without fully decompressing it, but instead by simply determining from the compressed data if the pattern would be present in a given piece of decompressed data.
If the pattern has any complexity at all this sounds like quite a complicated project as you'd have to handle cases where the pattern could be satisfied by the combination of output from two (or more) separate pieces of decompression.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading ("tailing") the end of a huge (>300GB) gzipped text file - linux

No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.

Related

Simplest format to archive a directory

Custom uncompress function BASH

gzip and pipe to output (performance consideration)

How to modify a gzip compressed file

Using sed on a compressed file

Categories

Resources