I am trying to zip a directory with Pigz utilizing it's multi core compression. I'm not necessarily trying to compress, but simply speed up the process of zipping files.
The issue I've come across is that once Pigz is complete, tar archives all files into a single tar inside the zipped file. Is there any way to zip all of these files in the directory, into 1 zip, (each file individually) without having a single tar file?
Related
TL;DR
How can I untar a file .tgz, and then selectively gzip the output?
My extracted dir has a few text files and a .nii file. I'd like to gzip the later.
More details
First method would be to just do sequentially. However I'm dealing with a huge dataset (10k+ tar archives) stored on a BeeGFS file system and I was told it would be better to do it in memory instead in two steps, since BeeGFS doesn't like handling big directories like this.
Sequential method:
for tarfile in ${rootdir}/*.tgz; do
tarpath="${tarfile%.tgz}"
tar zxvf ${tarfile} # (1) untar directory
gzip ${tarpath}/*.nii # (2) gzip the .nii file
done
Is there a way to combine (1) and (2)? Or do you have any other tips on how to do this process effectively?
Thanks!
You can extract a single file from the archive (If you know the filename), and have tar write it to standard output instead of to a file with -O, and then compress that stream and redirect it to a file. Something like
tar xzOf "$tarfile" "$tarpath/foo.nii" | gzip -c > "$tarpath/foo.nii.gz"
You can then extract everything else in the archive with tar xzf "$tarfile" --exclude "*.nii"
i used the following:
gzip -9 -c -r <some_directory> > directory.gz
how do i decompress this directory ?
I have tried
gunzip directory.gz
i am just left with a single file and not a directory structure.
As others have already mentioned, gzip is a file compression tool and not an archival tool. It cannot work with directories. When you run it with -r, it will find all files in a directory hierarchy and compress them, i.e. replacing path/to/file with path/to/file.gz. When you pass -c the gzip output is written to stdout instead of creating files. You have effectively created one big file which contains several gzip-compressed files.
Now, you could look for the gzip file header/magic number, which is 1f8b and then reconstruct your files manually.
The sensible thing to do now is to create backups (if you haven't already). Backups always help (especially with problems such as yours). Create a backup of your directory.gz file now. Then read on.
Fortunately, there's an easier way than manually reconstructing all files: using binwalk, a forensics utility which can be used to extract files from within other files. I tried it with a test file, which was created the same way as yours. Running binwalk -e file.gz will create a folder with all extracted files. It even manages to reconstruct the original file names. The hierarchy of the directories is probably lost. But at least you have your file contents and their names back. Good luck!
Remember: backups are essential.
(For completeness' sake: What you probably intended to run: tar czf directory.tar.gz directory and then tar xf directory.tar.gz)
gzip will compress 1+ files, though not meant to function like an archive utility. The posted cmd-line would yield N compressed file images concatenated to stdout, redirected to the named output file; unfortunately stuff like filenames and any dirs would not be recorded. A pair like this should work:
(create)
tar -czvf dir.tar.gz <some-dir>
(extract)
tar -xzvf dir.tar.gz
When backing up one or more _very_large_ files using tar with compression (-j or -z) how does GNU tar manage the use of temporary files and memory?
Does it backup and compress the files block by block, file by file, or some other way?
Is there a difference between the way the following two commands use temporary files and memory?
tar -czf data.tar.gz ./data/*
tar -cf - ./data/* | gzip > data.tar.gz
Thank you.
No temporary files are used by either command. tar works completely in a streaming fashion. Packaging and compressing are completely separated from each other and also done in a piping mechanism when using the -z or -j option (or similar).
For each file tar it puts into an archive, it computes a file info datagram which contains the file's path, its user, permissions, etc., and also its size. This needs to be known up front (that's why putting the output of a stream into a tar archive isn't easy without using a temp file). After this datagram, the plain contents of the file follows. Since its size is known and already part of the file info ahead, the end of the file is unambiguous. So after this, the next file in the archive can follow directly. In this process no temporary files are needed for anything.
This stream of bytes is given to any of the implemented compression algorithms which also do not create any temporary files. Here I'm going out on a limb a bit because I don't know all compression algorithms by heart, but all that I ever came in touch with do not create temporary files.
What is the Unix bash command to get the list of files (like ls) from archive file of type .bz2 (without unzipping the archive)?
First bzip2, gzip, etc compress only one file. So probably you have compressed tar file. To list the files you need command like:
tar tjvf file.bz2
This command uncompress the archive and test the content of tar.
Note that bzip2 compresses each file, and a simple .bz2 file always contains a single file of the same name with the ".bz2" part stripped off. When using bzip2 to compress a file, there is no option to specify a different name, the original name is used and .bz2 appended. So there are no files, only 1 file. If that file is a tar archive, it can contain many files, and the whole contents of the .tar.bz2 file can be listed with "tar tf file.tar.bz2" without unpacking the archive.
I have around thousands(approx say 20000) of files in sample.tgz which when I am doing decompression using tar-xf is taking more than 5 minutes.I want to speed it up to within a minute.The approach which I am thinking of is getting all the names of files in the .tgz file using -T option and then running tar in parallel in batches of say 500 file names.
Could somebody suggest a better approach?Please note that I have to use tar only here and not any other utilities like pigz and parallel etc.
Similarly if anyone can suggest the approach to compress it faster,that would also be helpful.
Also not that there is no .tgz file inside my sample.tgz file.
Tarballs are linear archives (mimicing the media they are named after; Tape ARchive) and so don't parallelize well until decompressed. Speeding up the decompression operation by using an algorithm such as LZ4 will help some, but if you're stuck with a gzipped tarball then the only chance you'll have of speeding it up is to use pigz instead of gzip to decompress it to a .tar file and then extract the files from there.