Custom uncompress function BASH

Custom uncompress function BASH - linux

So, getting right to the point for a script I'm making I need to make a custom function that can take compressed data from STDIN and pipe the uncompressed data to STDOUT regardless of the type of compression.
Example:
blah blah decryption stuff | custom_uncompress | other_program
With gzip I could do: "gzip -d -c" or for lzo "lzop -d -c" but I don't know what compression it has and cannot read the magic number from the file because it's encrypted.

As others have already noted in comment, it is impossible to uncompress data if one don't even know in what compressed format it is. The only best attempt one can do is to capture first couple of bytes from data stream and "guess" the result among selected common compression formats. When original data is compressed with any method not handled with this magic then it's out of luck.
This question is too unreasonable to have a proper answer.

Related

Reading ("tailing") the end of a huge (>300GB) gzipped text file

I have a text file which is >300GB in size originally, and gzipped it still has >10GB. (it is a database export which ran for days, and then was aborted, and I want to know the timestamp of the last exported entry so I can resume the export.)
I am interested in the last few lines of this text file, preferably without having to unzip the whole 300GB (even into memory). This file does not grow any more so I don't need to track changes or appended data a.k.a tail -f.
Is there a way to gunzip only the last part of the file?
tail --bytes=10000000 /mnt/myfile.db.gz | gunzip - |less
does not work (it returns stdin: not in gzip format). Since gzip can compress not just files, but also streams of data, it should be possible to search for an entry point somewhere in the file where to start uncompressing, without having to read the file header. Right?

No, not right. Unless the gzip stream was specially generated to allow random access, the only way to decode the last few lines is to decode the whole thing.

Quick followup on my own question: This is not possible using gzip without hackery (there are patches for gzip which compress in chunks and you can decode each chunk independently).
BUT you can use xz, and using the lowest compression ration (-0) the CPU load is comparable to gzip and compression as well. And xz can actually decompress parts of a compressed file.
I will consider this for the future.

gzip and pipe to output (performance consideration)

q1) Can i check if I do a
gzip -c file | encrypt (some parameters)
a) does gzip print out the output line by line and pipe it to the encrypt function or
b) gzip will be perform 1st, then the output will be pipe all at once to the encrypt function ?
====================================================
q2) Will performing gzip | encrypt have any better performance considerations then gzip, then encrypt
Regards,
Noob

Gzip is a streaming compressor/decompressor. So (for large enough inputs) the compressor/decompressor starts writing output before it has seen the whole input.
That's one of the reasons gzip compression is used for HTTP compression. The sender can compress while it's still generating content; the recipient can work on decompressing the first part of the content, while still receiving the rest.
Gzip does not work "line-by-line", because it doesn't know what a line is. But it does work "chunk-by-chunk", where the compressor defines the size of the chunk.
"Performance" is too vague a word, and too complex an area to give a yes or no answer.
With gzip -c file | encrypt, for a large enough file, will see encrypt and gzip working concurrently. That is, encrypt will be encrypting the first compressed block before gzip compresses the last chunk of file.

The size of a pipe buffer is implementation dependent. Under SunOS, it's 4kB. That is: gunzip < file.gz | encrypt will move in 4k chunks. Again, it depends on the OS. CygWIN might behave completely differently.
I should add that this is in man 7 pipe. Search for PIPE_BUF.

When does the writer of a named pipe do its work?

I'm trying to understand how a named pipe behaves in terms of performance. Say I have a large file I am decompressing that I want to write to a named pipe (/tmp/data):
gzip --stdout -d data.gz > /tmp/data
and then I sometime later run a program that reads from the pipe:
wc -l /tmp/data
When does gzip actually decompress the data, when I run the first command, or when I run the second and the reader attaches to the pipe? If the former, is the data stored on disk or in memory?

Pipes (named or otherwise) have only a very small buffer if any -- so if nothing is reading, then nothing (or very little) can be written.
In your example, gzip will do very little until wc is run, because before that point its efforts to write output will block. Out-of-the-box there is no nontrivial buffer either on-disk or in-memory, though tools exist which will implement such a buffer for you, should you want one -- see pv with its -B argument, or the no-longer-maintained (and, sadly, removed from Debian by folks who didn't understand its function) bfr.

How gxip -c works

I'm trying to understand how the gzip's "-c" option works.
gzip -c
Usage example:
gzip -c ${micro} >> ${macro}.gz
So this will concatenate micro into macro.gz, but whats the workflow?
will it first temporarely gunzip macro.gz, append micro, and then gzip again?
Its important for me to understand this, i have to do some jobs where i dont have a lot of eden space available and therefore all has to be gzipped and never decompressed.
Thx

First, if the data will never be decompressed, then you can send it to /dev/null instead.
To answer your question, no, it will not gunzip macro.gz. It will simply append a gzip stream to macro.gz. Per the standard, a concatenation of gzip streams is a valid gzip stream, so gunzipping that concatenation will give you the concatenation of the uncompressed inputs. That is, if in fact you do want to decompress it someday.

Using sed on a compressed file

I have written a file processing program and now it needs to read from a zipped file(.gz unzipped file may get as large as 2TB),
Is there a sed equivalent for zipped files like (zcat/cat) or else what would be the best approach to do the following efficiently
ONE=`zcat filename.gz| sed -n $counts`
$counts : counter to read(line by line)
The above method works, but is quite slow for large file as I need to read each line and perform the matching on certain fields.
Thanks
EDIT
Though not directly helpful, here are a set of zcommands
http://www.cyberciti.biz/tips/decompress-and-expand-text-files.html

Well you either can have more speed (i.e. use uncompressed files) or more free space (i.e. use compressed files and the pipe you showed)... sorry. Using compressed files will always have an overhead.

If you understand the internal structure of the compression format it is possible that you could write a pattern matcher that can operate on compressed data without fully decompressing it, but instead by simply determining from the compressed data if the pattern would be present in a given piece of decompressed data.
If the pattern has any complexity at all this sounds like quite a complicated project as you'd have to handle cases where the pattern could be satisfied by the combination of output from two (or more) separate pieces of decompression.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Custom uncompress function BASH - linux

Related

Reading ("tailing") the end of a huge (>300GB) gzipped text file

gzip and pipe to output (performance consideration)

When does the writer of a named pipe do its work?

How gxip -c works

Using sed on a compressed file

Categories

Resources