Fast Concatenation of Multiple GZip Files

Fast Concatenation of Multiple GZip Files - linux

I have list of gzip files:
file1.gz
file2.gz
file3.gz
Is there a way to concatenate or gzipping these files into one gzip file
without having to decompress them?
In practice we will use this in a web database (CGI). Where the web will receive
a query from user and list out all the files based on the query and present them
in a batch file back to the user.

With gzip files, you can simply concatenate the files together, like so:
cat file1.gz file2.gz file3.gz > allfiles.gz
Per the gzip RFC,
A gzip file consists of a series of "members" (compressed data sets). [...] The members simply appear one after another in the file, with no additional information before, between, or after them.
Note that this is not exactly the same as building a single gzip file of the concatenated data; among other things, all of the original filenames are preserved. However, gunzip seems to handle it as equivalent to a concatenation.
Since existing tools generally ignore the filename headers for the additional members, it's not easily possible to extract individual files from the result. If you want this to be possible, build a ZIP file instead. ZIP and GZIP both use the DEFLATE algorithm for the actual compression (ZIP supports some other compression algorithms as well as an option - method 8 is the one that corresponds to GZIP's compression); the difference is in the metadata format. Since the metadata is uncompressed, it's simple enough to strip off the gzip headers and tack on ZIP file headers and a central directory record instead. Refer to the gzip format specification and the ZIP format specification.

Here is what man 1 gzip says about your requirement.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. For example:
gzip -c file1 > foo.gz
gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
Needless to say, file1 can be replaced by file1.gz.
You must notice this:
gunzip will extract all members at once
So to get all members individually, you will have to use something additional or write, if you wish to do so.
However, this is also addressed in man page.
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.

Just use cat. It is very fast (0.2 seconds for 500 MB for me)
cat *gz > final
mv final final.gz
You can then read the output with zcat to make sure it's pretty:
zcat final.gz
I tried the other answer of 'gz -c' but I ended up with garbage when using already gzipped files as input (I guess it double compressed them).
PV:
Better yet, if you have it, 'pv' instead of cat:
pv *gz > final
mv final final.gz
This gives you a progress bar as it works, but does the same thing as cat.

You can create a tar file of these files and then gzip the tar file to create the new gzip file
tar -cvf newcombined.tar file1.gz file2.gz file3.gz
gzip newcombined.tar

Related

Splitting large tar file into multiple tar files

I have a tar file which is 3.1 TB(TeraByte)
File name - Testfile.tar
I would like to split this tar file into 2 parts - Testfil1.tar and Testfile2.tar
I tried the following so far
split -b 1T Testfile.tar "Testfile.tar"
What i get is Testfile.taraa(what is "aa")
And i just stopped my command. I also noticed that the output Testfile.taraa doesn't seem to be a tar file when I do ls in the directory. It seems like it is a text file. May be once the full split is completed it will look like a tar file?

The behavior from split is correct, from man page online: http://man7.org/linux/man-pages/man1/split.1.html
Output pieces of FILE to PREFIXaa, PREFIXab, ...
Don't stop the command let it run and then you can use cat to concatenate (join) them all back again.
Examples can be seen here: https://unix.stackexchange.com/questions/24630/whats-the-best-way-to-join-files-again-after-splitting-them
split -b 100m myImage.iso
# later
cat x* > myImage.iso
UPDATE
Just as clarification since I believe you have not understood the approach. You split a big file like this to transport it for example, files are not usable this way. To use it again you need to concatenate (join) pieces back. If you want usable parts, then you need to decompress the file, split it in parts and compress them. With split you basically split the binary file. I don't think you can use those parts.

You are doing the compression first and the partition later.
If you want each part to be a tar file, you should use 'split' first with de original file, and then 'tar' with each part.

Recognize files depending on extension

i need to recognize files with different extensions even when there is a combination of multiple extensions
so if my cwd has this files:
file-1 .zip
file-2 .tar
file-3 .tar.gz
file-4 .gz
file-5 .zip.tar
file-6 .tar.gz
file-7 .gz
i need to tell bash what to do when the extension (in this case) is:
zip
tar
zip.tar
tar.gz
gz
because for every extension i need to do different things, this implies that if the extension is .tar (only) or .gz (only) i need to do certain things, but if the extension is .tar.gz i need to run another snippet.
example:
if the filename has .tar extension i need to do
# stuff
tar xf filename.tar
# other stuff
if the filename has .zip.tar extension i need to run more complex code (but the code is not totally dependent on the extensions, my only objective is to get the full extension of the filename (filename.tar.gz should return .tar.gz instead of .gz or .tar)
Also, is there any way using gawk?

Use case:
case "$filename" in
*.tar.gz) code for .tar.gz ;;
*.gz) code for .gz ;;
*.zip.tar) code for .zip.tar ;;
*.tar) code for .tar ;;
...
esac
Just make sure you put the combined extensions before the single extensions that they contain, because case executes the statements for the first pattern that matches.

The file command is a good option to detect file types, then you can write a logic
file -i test.*
test.gz: application/x-gzip; charset=binary
test.tar: application/x-tar; charset=binary
test.tar.gz: application/x-gzip; charset=binary
test.zip: application/zip; charset=binary

Unzip the archive with more than one entry

I'm trying to decompress ~8GB .zip file piped from curl command. Everything I have tried is being interrupted at <1GB and returns a message:
... has more than one entry--rest ignored
I've tried: funzip, gunzip, gzip -d, zcat, ... also with different arguments - all end up in the above message.
The datafile is public, so it's easy to repro the issue:
curl -L https://archive.org/download/nycTaxiTripData2013/faredata2013.zip | funzip > datafile

Are you sure the mentioned file deflates to a single file? If it extracts to multiple files you unfortunately cannot unzip on the fly.
Zip is a container as well as compression format and it doesn't know where the new file begins. You'll have to download the whole file and unzip it.

Difference in .tar.gz and first gz and then tar

I made two compressed copy of my folder, first by using the command tar czf dir.tar.gz dir
This gives me an archive of size ~16kb. Then I tried another method, first i gunzipped all files inside the dir and then used
gzip ./dir/*
tar cf dir.tar dir/*.gz
but the second method gave me dir.tar of size ~30kb (almost double). Why there is so much difference in size?

Because zip process in general is more efficient on big sample than on small files. You have zipped 100 files of 1ko for example. Each file will have a certain compression, plus the overhead of the gzip format.
file1.tar -> files1.tar.gz (admit 30 bytes of headers/footers)
file2.tar -> files2.tar.gz (admit 30 bytes of headers/footers)
...
file100.tar -> files100.tar.gz (admit 30 bytes of headers/footers)
------------------------------
30*100 = 3ko of overhead.
But if you try to compress a tar file of 100ko (which contains your 100 files), the overhead of the gzip format will be added only one time (instead of 100 times) and the compression can be better)

Overhead from the per-file metadata and suboptimal conpression by gzip when processing files individually resulting from gzip not observing data in full and thus compressing with suboptimal dictionary (which is reset after each file).

tar cf should create an uncompressed archive, it means the size of your directory should almost be the same as your archive, maybe even more.
tar czf will run gunzip compression through it.
This can be further checked by doing a man tar in shell prompt in Linux,
-z, --gzip, --gunzip, --ungzip
filter the archive through gzip

How to compare two tarball's content

I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.
However, There are some restrictions:
first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work.
Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)
Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?
Thanks,

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):
pkgdiff PKG-0.tgz PKG-1.tgz

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.
If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.
One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!
execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.
Just remember that this last scheme does not really do checksum.
Sample tar and output (all files are zero size in this example).
$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/
Command to generate sorted name/size list
$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/
You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Here is my variant, it is checking the unix permission too:
Works only if the filenames are shorter than 200 char.
diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)

EDIT: See the comment by #StéphaneGourichon
I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:
tar -xOzf archive.tar.gz | sort | sha1sum
Note that the order of the arguments is important; particularly O which signals to use stdout.

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).
pip install diffoscope

I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).
go get github.com/VonC/gtarsum
It will read a tar file, and:
sort the list of files alphabetically,
compute a SHA256 for each file content,
concatenate those hashes into one giant string
compute the SHA256 of that string
The result is a "global hash" for a tar file, based on the list of files and their content.
It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.

Just throwing this out there since none of the above solutions worked for what I needed.
This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.
I know it's not as performant as others, but it provides the certainty I needed.
PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
tar -xvf $template --to-command=md5sum |
grep $PATH_TO_CHECK -A 1 |
grep -v $PATH_TO_CHECK |
awk '{print $1}' |
md5sum |
awk "{print \"$template\",\$1}"
done
*note: An invalid path simply returns nothing.

If not extracting the archives nor needing the differences, try diff's -q option:
diff -q 1.tar 2.tar
This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

There is tool called archdiff. It is basically a perl script that can look into the archives.
Takes two archives, or an archive and a directory and shows a summary of the
differences between them.

I have a similar question and i resolve it by python, here is the code.
ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you
import zipfile
import os,md5
import hashlib
import shutil
def decompressZip(zipName, dirName):
try:
zipFile = zipfile.ZipFile(zipName, "r")
fileNames = zipFile.namelist()
for file in fileNames:
zipFile.extract(file, dirName)
zipFile.close()
return fileNames
except Exception,e:
raise Exception,e
def md5sum(filename):
f = open(filename,"rb")
md5obj = hashlib.md5()
md5obj.update(f.read())
hash = md5obj.hexdigest()
f.close()
return str(hash).upper()
if __name__ == "__main__":
oldFileList = decompressZip("./old.zip", "./oldDir")
newFileList = decompressZip("./new.zip", "./newDir")
oldDict = dict()
newDict = dict()
for oldFile in oldFileList:
tmpOldFile = "./oldDir/" + oldFile
if not os.path.isdir(tmpOldFile):
oldFileMD5 = md5sum(tmpOldFile)
oldDict[oldFile] = oldFileMD5
for newFile in newFileList:
tmpNewFile = "./newDir/" + newFile
if not os.path.isdir(tmpNewFile):
newFileMD5 = md5sum(tmpNewFile)
newDict[newFile] = newFileMD5
additionList = list()
modifyList = list()
for key in newDict:
if not oldDict.has_key(key):
additionList.append(key)
else:
newMD5 = newDict[key]
oldMD5 = oldDict[key]
if not newMD5 == oldMD5:
modifyList.append(key)
print "new file lis:%s" % additionList
print "modified file list:%s" % modifyList
shutil.rmtree("./oldDir")
shutil.rmtree("./newDir")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string