How to compare two tarball's content - linux

I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.
However, There are some restrictions:
first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work.
Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)
Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?
Thanks,

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):
pkgdiff PKG-0.tgz PKG-1.tgz

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.
If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.
One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!
execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.
Just remember that this last scheme does not really do checksum.
Sample tar and output (all files are zero size in this example).
$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/
Command to generate sorted name/size list
$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/
You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Here is my variant, it is checking the unix permission too:
Works only if the filenames are shorter than 200 char.
diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)

EDIT: See the comment by #StéphaneGourichon
I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:
tar -xOzf archive.tar.gz | sort | sha1sum
Note that the order of the arguments is important; particularly O which signals to use stdout.

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).
pip install diffoscope

I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).
go get github.com/VonC/gtarsum
It will read a tar file, and:
sort the list of files alphabetically,
compute a SHA256 for each file content,
concatenate those hashes into one giant string
compute the SHA256 of that string
The result is a "global hash" for a tar file, based on the list of files and their content.
It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.

Just throwing this out there since none of the above solutions worked for what I needed.
This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.
I know it's not as performant as others, but it provides the certainty I needed.
PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
tar -xvf $template --to-command=md5sum |
grep $PATH_TO_CHECK -A 1 |
grep -v $PATH_TO_CHECK |
awk '{print $1}' |
md5sum |
awk "{print \"$template\",\$1}"
done
*note: An invalid path simply returns nothing.

If not extracting the archives nor needing the differences, try diff's -q option:
diff -q 1.tar 2.tar
This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

There is tool called archdiff. It is basically a perl script that can look into the archives.
Takes two archives, or an archive and a directory and shows a summary of the
differences between them.

I have a similar question and i resolve it by python, here is the code.
ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you
import zipfile
import os,md5
import hashlib
import shutil
def decompressZip(zipName, dirName):
try:
zipFile = zipfile.ZipFile(zipName, "r")
fileNames = zipFile.namelist()
for file in fileNames:
zipFile.extract(file, dirName)
zipFile.close()
return fileNames
except Exception,e:
raise Exception,e
def md5sum(filename):
f = open(filename,"rb")
md5obj = hashlib.md5()
md5obj.update(f.read())
hash = md5obj.hexdigest()
f.close()
return str(hash).upper()
if __name__ == "__main__":
oldFileList = decompressZip("./old.zip", "./oldDir")
newFileList = decompressZip("./new.zip", "./newDir")
oldDict = dict()
newDict = dict()
for oldFile in oldFileList:
tmpOldFile = "./oldDir/" + oldFile
if not os.path.isdir(tmpOldFile):
oldFileMD5 = md5sum(tmpOldFile)
oldDict[oldFile] = oldFileMD5
for newFile in newFileList:
tmpNewFile = "./newDir/" + newFile
if not os.path.isdir(tmpNewFile):
newFileMD5 = md5sum(tmpNewFile)
newDict[newFile] = newFileMD5
additionList = list()
modifyList = list()
for key in newDict:
if not oldDict.has_key(key):
additionList.append(key)
else:
newMD5 = newDict[key]
oldMD5 = oldDict[key]
if not newMD5 == oldMD5:
modifyList.append(key)
print "new file lis:%s" % additionList
print "modified file list:%s" % modifyList
shutil.rmtree("./oldDir")
shutil.rmtree("./newDir")

Related

How to preserve timestamp of original file post zip compression?

I have a lot of files on our servers which we compression with a filter that only the files older than x days will get compressed.
The zip command compresses the original, makes a filename.zip and removes the original.
This has a small problem that the timestamp changes since the compression job runs after x days.
So when we run files to remove older files (which are by now zip files), not all files get removed since the timestamp has changed from the original file to the compressed file.
I would like to add a condition where while zipping, i want the original timestamp of the file to be retained by the zip archive even though its running at a later date.
One way of doing this would be to
Get timestamp of each original file with a date command
Compress the original, remove the original
Use and insert the earlier stored timestamp to the new zip file using "touch"
I am looking for a simpler solution.
Some old file I had:
$ ls -l foo
-rw-r--r-- 1 james james 120 Sep 5 07:28 foo
Zip and redate:
$ zip foo.zip foo && touch -d "$(date -R -r foo)" foo.zip
Check it out:
$ ls -l foo.zip
-rw-r--r-- 1 james james 120 Sep 5 07:28 foo.zip
Remove the original:
$ rm -i foo
Yes you can unzip a file and preserve the old timestamp from the original time it was created. Steps to do this are as below:
Click on the filename.zip, properties
In the General tab, the security says "This file came from another computer and might be blocked to help protect this computer". Click on the Unblock check box and click OK
Extract the file and volla, the extracted file has the datatime stamp when the file was created/modified

BusyBox tar: append workaround given limited disk space?

I'm on a Linux system with limited resources and BusyBox -- this version of tar does not support --append, -r. Is there a workaround that will allow me to [1] append files from directory B to an existing tar of files from directory A after [2] making the B-files appear to have come from directory A? (Later, when someone extracts the files, they should all end up in the same directory A.)
Situation: I have a list of files that I want to tar, but I must process some of these files first. The files might be used by other processes so I don't want to edit them in-place. I want to be conservative when using disk space so my script only copies those files which it needs to change (vs copying them all and then processing some and finally archiving them all with tar -- if I copied them all I might run into disk space issues).
This means the files I want to archive end up in two separate locations. But I want the resulting tar file to appear as if they were all in the same location. Near the end of my script, I end up with two text files listing the A and B files by name.
I think this is straightforward with a full-blown version of tar, but I have to work with the BusyBox version (usage below). Thanks in advance for any ideas!
Usage: tar -[cxtzjaZmvO] [-X FILE] [-f TARFILE] [-C DIR] [FILE]...
Create, extract, or list files from a tar file
Operation:
c Create
x Extract
t List
Options:
f Name of TARFILE ('-' for stdin/out)
C Change to DIR before operation
v Verbose
z (De)compress using gzip
j (De)compress using bzip2
a (De)compress using lzma
Z (De)compress using compress
O Extract to stdout
h Follow symlinks
m Don't restore mtime
exclude File to exclude
X File with names to exclude
T File with names to include
In principle, you just need to append a tar repository containing the additional files to the end of the tar file. It is only slightly more difficult than that.
A tar file consists of any number of repetitions of header + file. The header is always a single 512-byte block, and the file is padded to a multiple of 512 bytes, so you can think of these units as being a variable number of 512-byte blocks. Each block is independent; it's header starts with the full pathname to the file. So there is no requirement that files in a directory be tarred together.
There is one complication. At the end of the tar file, there are at least two 512-byte blocks completely filled with 0s. When tar is reading a tar file, it will ignore a single zero-filled header, but the second one will cause it to stop reading the file. If it hits EOF, it will complain, so the terminating empty headers are required.
There might be more than two headers, because tar actually writes in blocks which are a multiple of 512 bytes. Gnu tar, for example, by default writes in multiples of 20 512-byte chunks, so the smallest tar file is normally 10240 bytes.
In order to append new data, you need to first truncate the existing file to eliminate the empty blocks.
I believe that if the tar file was produced by busybox, there will only be two empty blocks, but I haven't inspected the code. That would be easy; you only need to truncate the last 1024 bytes of the file before appending the additional files.
For general tar files, it is trickier. If you knew that the files themselves didn't have NUL bytes in them (i.e. they were all simple text files), you could remove empty headers until you found a block with a non-0 byte in it, which wouldn't be too difficult.
What I would do is:
Truncate the last 1024 bytes of the tar file.
Remember the current size of the tar file.
Append a test tar file consisting of the tar of a file with a simple short message
Verify that tar tf correctly shows the test file
Truncate the file back to the remembered length,
If the tar tf found the test file's name, succeed
If the last 512 bytes of the tar file are all 0s, truncate the last 512 bytes of the file, and return to step 2.
Otherwise fail
If the above procedure succeeds, you can proceed to append the tar repository with the new files.
I don't know if you have a trunc command. If not, you can use dd copy a file over top of an old file at a specified offset (see the seek= option). dd will truncate the file automatically at the end of the copy. You can also use dd to read a 512 byte block (see the skip and count options).
The best solution is to cut the last 1024 bytes and concatenate a new tar after it. In order to append a tar to an existing tar file, they must be uncompressed.
For files like:
$ find a b
a
a/file1
b
b/file2
You can:
$ tar -C a -czvf a.tar.gz .
$ gunzip -c a.tar.gz | { head -c -$((512*2)); tar -C b -c .; } | gzip > a+b.tar.gz
With the result:
$ tar -tzvf a+b.tar.gz
drwxr-xr-x 0/0 0 2018-04-20 16:11:00 ./
-rw-r--r-- 0/0 0 2018-04-20 16:11:00 ./file1
drwxr-xr-x 0/0 0 2018-04-20 16:11:07 ./
-rw-r--r-- 0/0 0 2018-04-20 16:11:07 ./file2
Or you can create both tar in the same command:
$ tar -C a -c . | { head -c -$((512*2)); tar -C b -c .; } | gzip > a+b.tar.gz
Although this is for tar generated by busybox tar. As mentioned in previous answer, GNU tar add multiple of 20 blocks. You need to force the number of blocks to be 1 (--blocking-factor=1) in order to know in advance how many blocks to cut:
$ tar --blocking-factor=1 -C a -c . | { head -c -$((512*2)); tar -C b -c .; } | gzip | tar --blocking-factor=1 -tzv
Anyway, GNU tar do have --append. The last --blocking-factor=1 is only needed if you indent do append the resulting tar again.

perl while loop

In this code I parse a file (containing the output from ls -lrt) for a log file's modification date. Then I move all log files into a new folder with their modification dates added to the filenames, and than making a tar of all those files.
The problem I am getting is in the while loop. Because it's reading the data for all the files the while loop keeps on running 15 times. I understand that there is some issue in the code but I can't figure it out.
Inside the while loop I am splitting the ls -lrt records to find the log file modified date. $file is the output of the ls command that I am storing in the text file /scripts/yagya.txt in order to get the modification date. But the while loop is executing 15 times since there are 15 log files in the folder which match the pattern.
#!/usr/bin/perl
use File::Find;
use strict;
my #field;
my $filenew;
my $date;
my $file = `ls -lrt /scripts/*log*`;
my $directory="/scripts/*.log";
my $current = localtime;
my $current_time = $current;
$current_time = s/\s+//g;
my $freetime = $current_time;
my $daytime = substr($current_time,0,8);
my $seconddir = "/$freetime/";
system ("mkdir $seconddir");
open (MYFILE,">/scripts/yagya.txt");
print MYFILE "$file";
close (MYFILE);
my $data = "/scripts/yagya.txt";
my $datas = "/scripts/";
my %options = (
wanted => \&wanted,
untaint => 1
);
find (\%options, $datas);
sub wanted {
if (/[._]log\d*$/){
my $files;
my #fields;
my $fields;
chomp;
$files=$_;
open (MYFILE,$data);
while(<MYFILE>){
chop;
s/#.*//;
next unless /\S/;
#fields = (split)[5,6,7];
$fields = join('',#fields), "\n";
}
close (MYFILE);
system ("mv $files $seconddir$fields$files");
}
}
system ("tar cvf /$daytime/$daytime.tar.gz /$daytime/*log*");
system ("rm $seconddir*log*");
system ("rm $data");
Your code is very difficult to read. It looks like you have written the program as a single big chunk before you started to test it. That way of working is common but very wrong. You should start by implementing a small part of the program and testing that before you add a little more functionality, test again, and so on. That way you won't be overwhelmed with fixing many problems at once in a large untested program.
It would also help you a lot if you added use warnings to your use strict at the top of the program. It helps to catch simple errors that you may overlook.
Also, are you aware that File::Find will call your wanted callback subroutine every time it encounters a file? It doesn't pass all the files at once.
The problem seems to be that you are reading all the way through the yagya.txt file when you should be stopping when you find the record that matches the current file that File::Find has found. What you need to do is to check whether the current record in the ls output ends with the name of the current file. If you write the loop like this
while (<MYFILE>) {
if (/\Q$files\E$/) {
my #fields = (split)[5,6,7];
$fields = join('',#fields);
last;
}
}
then $fields will end up with the modification date of the current file, which is what you want.
But this would be a thousand times easier if you used Perl to read the file modification date for you.
Instead of writing an ls listing to a file and reading it back, you should do something like this
use File::stat;
my $mtime = localtime(stat($files)->mtime);
which will give you a string like Wed Jun 13 11:25:23 2012. The date from my ls output includes only the month name, day of month, and time of day, like Jun 8 12:37. That isn't very specific and you perhaps should at least include a year, but to generate the same string from this $mtime you can write
my $fields = join '', (split ' ', $mtime)[1,2,3];
There is a lot more I could say about your program, but I hope this gets it going for you for now.
Another couple of things I have noticed:
The line $current_time = s/\s+//g should be $current_time =~ s/\s+//g to remove all spaces from the current time string
A value like Sun Jun 3 11:50:54 2012 will be reduced to SunJun311:53:552012, and $daytime will then take the value SunJun31 which is incorrect
I'm usually not recommending using bash instead of perl, but sometimes it is much shorter
this problem has 2 parts:
rename files into another directory and adding timestamp into the filenames
archive them by every minutes or hours, days ... etc..
for 1.)
find ./scripts -name \*[_.]log\* -type f -printf "%p\0./logs/%TY%Tm%Td-%TH%Tk%TM-%f\0" | xargs -0 -L 2 mv
The above will find all plain files with [_.]log in their names and rename them into the ./logs directory with timestamp prefix. e.g.
./scripts/aaa.log12 get renamed into ./logs/20120403-102233-aaa.log12
2.) archiving
ls logs | sed 's/\(........-....\).*/\1/' | sort -u | while read groupby
do
( cd logs && echo tar cvzf ../$groupby.tgz $groupby* )
done
this will create tar archives by timestamp-prefix. (Assumed than the ./logs contain only files with valid (timestamped) filenames)
Of course, the above sed pattern is not nice, but clearly shows deleting seconds from the timestamp - so it is creating archives by minutes. If want another grouping, you can use:
sed 's/\(........-..\).*/\1/' - by hours
sed 's/\(........\).*/\1/' - by days
Other:
the -printf for find is supported only in gnu version of find - common in Linux
usually not a good practice working directly in '/', like /scripts, therefore my example uses ./
if in your ./scrips subtree exists the same filename with the same timestamp, the mv will overwrite the first, e.g. both of ./scripts/a/a.log and ./scripts/x/a.log with the same timestamp will be renamed into ./logs/TIMESTAMP-a.log

grep but indexable?

I have over 200mb of source code files that I have to constantly look up (I am part of a very big team). I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Is there a command line utility similar to grep which has indexing ability?
The solutions below are rather simple. There are a lot of corner cases that they do not cover:
searching for start of line ^
filenames containing \n or : will fail
filenames containing white space will fail (though that can be fixed by using GNU Parallel instead of xargs)
searching for a string that matches the path of another files will be suboptimal
The good part about the solutions is that they are very easy to implement.
Solution 1: one big file
Fact: Seeking is dead slow, reading one big file is often faster.
Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . > .index
Use the index:
grep foo .index
Solution 2: one big compressed file
Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.
So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo
Solution 3: use index for finding potential candidates
Generating the index can be time consuming and you might not want to do that for every single change in the dir.
To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.
The sort -u is needed to avoid grepping the same file multiple times.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
Solution 4: append to the index
Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Append to the index:
find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
It can be even faster if you use pzstd instead of pbzip2/pbzcat.
Solution 5: use git
git grep can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.
The good part is that the .git index is smaller than the .index.bz2.
Index a dir:
git init
git add .
Append to the index:
git add .
Use the index:
git grep foo
Solution 6: optimize git
Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:
git gc --aggressive
This takes a while, but it packs the index very efficiently in few files.
Now you can do:
find .git -type f | xargs cat >/dev/null
git grep foo
git will do a lot of seeking into the index, but by running cat first, you put the whole index into RAM.
Adding to the index is the same as in solution 5, but run git gc now and then to avoid many small files, and git gc --aggressive to save more disk space, when the system is idle.
git will not free disk space if you remove files. So if you remove large amounts of data, remove .git and do git init; git add . again.
There is https://code.google.com/p/codesearch/ project which is capable of creating index and fast searching in the index. Regexps are supported and computed using index (actually, only subset of regexp can use index to filter file set, and then real regexp is reevaluted on the matched files).
Index from codesearch is usually 10-20% of source code size, building an index is fast like running classic grep 2 or 3 times, and the searching is almost instantaneous.
The ideas used in the codesearch project are from google's Code Search site (RIP). E.g. the index contains map from n-grams (3-grams or every 3-byte set found in your sources) to the files; and regexp is translated to 4-grams when searching.
PS And there are ctags and cscope to navigate in C/C++ sources. Ctags can find declarations/definitions, cscope is more capable, but has problems with C++.
PPS and there are also clang-based tools for C/C++/ObjC languages: http://blog.wuwon.id.au/2011/10/vim-plugin-for-navigating-c-with.html and clang-complete
I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Without addressing the indexing ability part, git grep will have, with Git 2.8 (Q1 2016) the abililty to run in parallel!
See commit 89f09dd, commit 044b1f3, commit b6b468b (15 Dec 2015) by Victor Leschuk (vleschuk).
(Merged by Junio C Hamano -- gitster -- in commit bdd1cc2, 12 Jan 2016)
grep: add --threads=<num> option and grep.threads configuration
"git grep" can now be configured (or told from the command line) how
many threads to use when searching in the working tree files.
grep.threads:
Number of grep worker threads to use.
ack is a code searching tool that is optimized for programmers, especially programmers dealing with large heterogeneous source code trees: http://beyondgrep.com/
Is some of your search examples where you only want to search a certain type of file, like only Java files? Then you can do
ack --java function
ack does not index the source code, but it may not matter depending on what your searching patterns are like. In many cases, only searching for certain types of files gives the speedup that you need because you're not also searching all those other XML, etc files.
And if ack doesn't do it for you, here is a list of many tools designed for searching source code: http://beyondgrep.com/more-tools/
We use a tool internally to index very large log files and make efficient searches of them. It has been open-sourced. I don't know how well it scales to large numbers of files, though. It multithreads by default, it searches inside gzipped files, and it caches indexes of previously searched files.
https://github.com/purestorage/4grep
This grep-cache article has a script for caching grep results. His examples were run on windows with linux tools installed, so it can easily be used on nix/mac with little modification. It's mostly just a perl script anyway.
Also, the filesystem itself (assuming your using *nix) often caches recently read data, causing future grep times to be faster since grep is effectively searching virt memory instead of disk.
The cache is usually located in /proc/sys/vm/drop_caches if you want manually erase it to see the speed increase from an uncached to a cached grep.
Since you mention various kinds of text files that are not really code, I suggest you have a look at GNU ID utils. For example:
cd /tmp
# create index file named 'ID'
mkid -m /dev/null -d text /var/log/messages.*
# query index
gid -r 'spamd|kernel'
These tools focus on tokens, so queries on strings of tokens are not possible. There is minimal integration in emacs for the gid command.
For the more specific case of indexing source code, I prefer to use GNU global, which I find more flexible. For example:
cd sourcedir
# index source tree
gtags .
# look for a definition
global -x main
# look for a reference
global -xr printf
# look for another kind of symbol
global -xs argc
Global natively supports C/C++ and Java, and with a bit of configuration, can be extended to support many more languages. It also has very good integration with emacs: successive queries are stacked, and updating a source file updates the index efficiently. However I'm not aware that it is able to index plain text (yet).

Fast Concatenation of Multiple GZip Files

I have list of gzip files:
file1.gz
file2.gz
file3.gz
Is there a way to concatenate or gzipping these files into one gzip file
without having to decompress them?
In practice we will use this in a web database (CGI). Where the web will receive
a query from user and list out all the files based on the query and present them
in a batch file back to the user.
With gzip files, you can simply concatenate the files together, like so:
cat file1.gz file2.gz file3.gz > allfiles.gz
Per the gzip RFC,
A gzip file consists of a series of "members" (compressed data sets). [...] The members simply appear one after another in the file, with no additional information before, between, or after them.
Note that this is not exactly the same as building a single gzip file of the concatenated data; among other things, all of the original filenames are preserved. However, gunzip seems to handle it as equivalent to a concatenation.
Since existing tools generally ignore the filename headers for the additional members, it's not easily possible to extract individual files from the result. If you want this to be possible, build a ZIP file instead. ZIP and GZIP both use the DEFLATE algorithm for the actual compression (ZIP supports some other compression algorithms as well as an option - method 8 is the one that corresponds to GZIP's compression); the difference is in the metadata format. Since the metadata is uncompressed, it's simple enough to strip off the gzip headers and tack on ZIP file headers and a central directory record instead. Refer to the gzip format specification and the ZIP format specification.
Here is what man 1 gzip says about your requirement.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. For example:
gzip -c file1 > foo.gz
gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
Needless to say, file1 can be replaced by file1.gz.
You must notice this:
gunzip will extract all members at once
So to get all members individually, you will have to use something additional or write, if you wish to do so.
However, this is also addressed in man page.
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.
Just use cat. It is very fast (0.2 seconds for 500 MB for me)
cat *gz > final
mv final final.gz
You can then read the output with zcat to make sure it's pretty:
zcat final.gz
I tried the other answer of 'gz -c' but I ended up with garbage when using already gzipped files as input (I guess it double compressed them).
PV:
Better yet, if you have it, 'pv' instead of cat:
pv *gz > final
mv final final.gz
This gives you a progress bar as it works, but does the same thing as cat.
You can create a tar file of these files and then gzip the tar file to create the new gzip file
tar -cvf newcombined.tar file1.gz file2.gz file3.gz
gzip newcombined.tar

Resources