So, all tools that I know put the compressed by deflate algorithm stream in some file format with headers, file names, additional check sums etc.
Is it possible to create directly a file containing only the deflate compressed stream, as described in RFC-1951, using some of standard Linux tools + bash?
I've seen some development tools which can do it, but with normal tools it is not (immediately) possible, because the raw compressed stream is generally useless.
Otherwise, on Linux, gzip --no-name results in compressed stream with header of fixed size of 10 bytes. You can trim it with dd.E.g.:
cat something | gzip --no-name | \
( dd of=/dev/null bs=1 count=10; cat > gzip-without-header )
All what's left, is to strip the last 8 bytes (CRC, uncompressed size) from the output file:
dd if=gzip-without-header of=gzip-without-anything \
bs=1 count=$[ `stat -c '%s' gzip-without-header` - 8 ]
P.S. GZip file format is defined in RFC1952.
a slightly better solution (without writing to a temporary file) would be:
cat something | gzip --no-name | tail --bytes=+11 | head --bytes=-8 > gzip-without-anything
gzip -nc file or gzip < file will produce on stdout a deflate stream with a 10-byte header and an 8-byte trailer. You can delete the header and trailer using dd, though you'll need to see how big the output is in order to give the right value to dd to cut the end off.
Related
I have to process a file using my Linux machine.
When I try to write my output to a csv file then gzip it in the same line of script:
processing > output.csv | gzip -f output.csv
I get an 'unexpected end of file' error. Even when I download the file using the Linux machine I get the same error.
When I do not gzip via terminal (or in a single line) everything works fine.
Why does it fail like this when the commands are all in a single line?
You should remove > output.csv
You can either:
Use a pipe: | or:
Redirect to a file
For the same stream (stdout)
You can redirect errors from stderr to a file with 2>errors.txt or they will display on screen
When you redirect a process' IO with the > operator, its output cannot be used by a pipe afterwards (because there's no "output" anymore to be piped). You have two options:
processing > output.csv &&
gzip output.csv
Writes the unprocessed output of your program to the file output.csv and then in a second task gzips this file, replacing it with output.gz. Depending on the amount of data, this might not be feasible (storage reqirements are the full uncompressed output PLUS the compressed size)
processing | gzip > output.csv.gz
This will compress the output of your process in-line and write it directly to the output file, without storing the uncompressed output in an intermediate file.
I am trying to quickly assess the line number of gzipped files. I do this by checking the uncompressed size of the file, sampling lines from the beginning of the file with zcat filename | head -n 100 (for instance), and dividing the uncompressed size by the average line size of this sample of 100 lines.
The problem is that the data I'm receiving from gzip -l is invalid. Mostly it seems the uncompressed size is too small, in some cases producing negative compression values. For example, in one case the compressed file is 1.8gb, and the uncompressed is listed as 0.7gb by gzip -l, when it is actually 9gb when decompressed. I tried to decompress and recompress but still get the same uncompressed size.
gzip 1.6 on ubuntu 18.04.3
Below is the part of the gzip spec (RFC 1952) where it defines how the uncompressed size is stored in the gzip file.
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.
You are working with a gzip archive where the uncompressed size is > 2^32, so the uncompressed size reported by gzip -l is always going to be incorrect.
Note that this design limitation in the gzip file format doesn't cause any problems when uncompressing the archive. The only impact is with gzip -l or gunzip -l
I have a big gzip file that I need to change it to bzip2.
Obvious way is to 1) decompress the file in memory, 2) write it on disk, 3) read the file again and compress it to bzip2 and write into disk.
Now I'm wondering if it's possible to avoid the middle phase (writing into disk) and do the decompression and compression in memory and write the final result in disk?
You could decompress to stdout and then pipe to bzip2, something like this should work:
gunzip -c file.gz | bzip2 > file.bz2
I want to create an iso from an external hard drive.
I used this command:
sudo dd if=/dev/sdb of=usb-image.iso
It works, however, the disk is large (700 GB), and i dont have space on my laptop to store that much.
I was thinking about creating multiple iso files (each file 5 GB for example), this way, I can manage them by storing some parts on other drives.
Any help?
Thanks
I'd use the split program to split the output from dd into different files. You can adjust the split size as you see fit (look at the 5000m argument):
dd if=/dev/sdb | split -b 5000m - /tmp/output.gz
This will yield files like /tmp/output.gz.aa, /tmp/output.gz.ab, etc.
Additionally, for further space storage, you can gzip your archive midstream, like this:
dd if=/dev/sdb | gzip -c | split -b 5000m - /tmp/output.gz
Later, when you want to restore, do this:
cat /tmp/output.gz.* | gzip -dc | dd of=/dev/sdb
I have list of gzip files:
file1.gz
file2.gz
file3.gz
Is there a way to concatenate or gzipping these files into one gzip file
without having to decompress them?
In practice we will use this in a web database (CGI). Where the web will receive
a query from user and list out all the files based on the query and present them
in a batch file back to the user.
With gzip files, you can simply concatenate the files together, like so:
cat file1.gz file2.gz file3.gz > allfiles.gz
Per the gzip RFC,
A gzip file consists of a series of "members" (compressed data sets). [...] The members simply appear one after another in the file, with no additional information before, between, or after them.
Note that this is not exactly the same as building a single gzip file of the concatenated data; among other things, all of the original filenames are preserved. However, gunzip seems to handle it as equivalent to a concatenation.
Since existing tools generally ignore the filename headers for the additional members, it's not easily possible to extract individual files from the result. If you want this to be possible, build a ZIP file instead. ZIP and GZIP both use the DEFLATE algorithm for the actual compression (ZIP supports some other compression algorithms as well as an option - method 8 is the one that corresponds to GZIP's compression); the difference is in the metadata format. Since the metadata is uncompressed, it's simple enough to strip off the gzip headers and tack on ZIP file headers and a central directory record instead. Refer to the gzip format specification and the ZIP format specification.
Here is what man 1 gzip says about your requirement.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. For example:
gzip -c file1 > foo.gz
gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
Needless to say, file1 can be replaced by file1.gz.
You must notice this:
gunzip will extract all members at once
So to get all members individually, you will have to use something additional or write, if you wish to do so.
However, this is also addressed in man page.
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.
Just use cat. It is very fast (0.2 seconds for 500 MB for me)
cat *gz > final
mv final final.gz
You can then read the output with zcat to make sure it's pretty:
zcat final.gz
I tried the other answer of 'gz -c' but I ended up with garbage when using already gzipped files as input (I guess it double compressed them).
PV:
Better yet, if you have it, 'pv' instead of cat:
pv *gz > final
mv final final.gz
This gives you a progress bar as it works, but does the same thing as cat.
You can create a tar file of these files and then gzip the tar file to create the new gzip file
tar -cvf newcombined.tar file1.gz file2.gz file3.gz
gzip newcombined.tar