Tool that improves compression by uncompressing inner archives

Tool that improves compression by uncompressing inner archives - linux

There was a compression tool that uncompressed the inner gz/bz2/xz/etc files before storing them in a tar format and archiving them, and I don't remember its name. I'm creating archives that contain very similar rpm/deb/tgz packages, and only applying compression at the end will probably improve compression ratio significantly.
From what I remember, the tool also stored a metadata file that recorded what compression options were used, in order to reproduce identical zipped files during decompression.

Found it: https://github.com/schnaader/precomp-cpp
It's not clear yet whether it only does recompressing of a given archive, or it also accepts a list of input files/dirs (some of which may already be compressed), like tar does.

Related

Is there a way to add or update file in tar.gz(tgz) without decompress?

I'm looking for a way to update tgz file.
I know what I can update tgz file by unzipping it, inserting file in directory and re-compressing it.
But I do not want decompressed file was created in my disc.
Tar has 'r' option(--append) what appending files to the end of an archive but when I use tar with 'r' option, console logged 'cannot update compressed archive.', 'Error is not recoverable: exiting now.'
If it is impossible, I cannot help but if there is a way, plz let me know.
I modified my question according to Basile Starynkevitch's comment.
This is command for testing.
tar -rf myComppressedFile.tgz insert.txt
Result
tar: Cannot update compressed archives
tar: Error is not recoverable: exiting now
vim myCompressedFile.tgz
./file1.txt
./file2.txt
./file3.txt
./directory1/
./directory1/file4.txt
what I want to update
./file1.txt
./file2.txt
./file3.txt
./directory1/
./directory1/file4.txt
./insert.txt <<<< I want to add like this file
My tgz file is about more 100 megabytes.

Is there a way to add or update file in tar.gz(tgz) without decompress?
No, there is no way.
Since compression by gzip happens on the entire tar archive (after making the tar ball). Observe the runtime behavior of your tar command using strace(1) (so read syscalls(2)) or ltrace(1)
Read documentation of tar(1). Study the source code of GNU tar (it is free software).
On my Debian/Sid/x86-64, the libc6-dev package provides the /usr/include/tar.h header. I invite you to look inside that header file, it describes the format of tar balls.
Consider using other approaches: some sqlite database, using postgresql or mongodb or GDBM, some afio archive (where each file is perhaps compressed before archiving it). See also the tardy utility.
I'm looking for a way to update tgz file
Why ?
Did you consider instead using version control on the underlying files before their archival with tar? Something like git?
But I do not want decompressed file was created in my disk.
Why ?
The decompressed file might be a temporary one....
Of course, you would need to have different approaches if you deal with a few megabytes of data or if you have a few petabytes of data. Of course, if you are developing a critical embedded application (e.g. medical devices, interplanetary satellite software, DO-178C software systems), things could be different.
There are lots of trade-offs to consider (including money and development time and economical impact of loss of data or of corrupted archive and legal regulations regarding the archive and its data integrity : in a flight recorder you should ensure the data is readable after the crash of an aircraft).
My tgz file is about more 100 megabytes.
This is tiny, and practically would fit (for most Linux systems in 2020, even a cheap RaspBerryPi) in the page cache. You might use some file system kept in RAM (e.g. tmpfs)

Strategy for compressing and navigating large compressed directories

I manage a computer cluster. It is a multi-user system. I have a large directory filled with files (terabytes in size). I'd like to compress it so the user who owns it can save space and still be able to extract files from it.
Challenges with possible solutions :
tar : The directory's size makes it challenging to decompress the subsequent tarball due to tar's poor random access read. I'm referring to the canonical way of compressing, i.e. tar cvzf mytarball.tar.gz mybigdir
squashfs : It appears that this would be a great solution, except in order to mount it, it requires root access. I don't really want to be involved in mounting their squashfs file every time they want to access a file.
Compress then tar : I could compress the files first and then use tar to create the archive. This would have the disadvantage that I wouldn't save as much space with compression and I wouldn't get back any inodes.
Similar questions (here) have been asked before, but the solutions are not appropriate in this case.
QUESTION:
Is there a convenient way to compress a large directory such that it is quick and easy to navigate and doesn't require root permissions?

You add it in tags, but do not mention it in question. For me zip is the simplest way to manage big archives (with many files). Moreover tar+gzip is actually two step operation which need special operations to speedup. And zip is available for lot of platforms so you win also in this direction.

tar.gz alternative for archiving with ability to quickly display the archives contents

Usually I create archives of data in linux at the commandline with tar & gzip (or pigz - as this uses parallel processing for compression).
However, listing the contents of such an archive is painfully slow because of the sequential format of tar archives. This is especially true if an archive contains many files that are several GB each.
What is an alternative to my combination to create gzipped tar archives of files in linux. Especially I'm looking for something that allows for a retrieval of the list or tree of files inside the archives, similar to tar - but much more performant?

zip? The zip file format contains a catalog of the contents (at the end, IIRC), which can be retrieved with zipinfo(1).

7zip is probably the best solution nowadays.
ZIP format is a bit outdated and was designed for FAT filesystem, which is where many of its limitations come from.
dar also might be an option. But as far as I can tell there is only one developer and no community around (unlike 7zip which has several forks and ports made by independent developers).

convert tar to zip with stdout/stdin

How can I convert a tar file to zip using stdout/stdin?
-# takes a list of files from stdin, tar -t provides that but doesn't actually extract the files. Using -xv provides me a list of files and extracts it to disk but I'm trying to avoid touching the disk.
Something like the following but obviously the below will not produce the same file structure.
tar -xf somefile.tar -O | zip somefile.zip -
I can do it by temporarily writing files to disk but I'm trying to avoid that - I'd like to use pipes only.

Basically I'm not aware of "standard" utility doing the conversion on the fly.
But similar question has been discussed over here using a python libs, which has a simple structure, probably you can adapt the script to your requirements.

I never heard about the tool that performs direct conversion. As a workaround, if your files are not extremely huge, you can extract the tar archive into a tmpfs-mounted directory (or similar), that is, the filesystem that resides completely in memory and don't work with disk. It should be much faster than extracting into the disk folder.

Is there a way to make zip or other compressed files that extract more quickly?

I'd like to know if there's a way to make a zip file, or any other compressed file (tar,gz,etc) that will extract as quickly as possible. I'm just trying to move one folder to another computer, so I'm not concerned with the size of the file. However, I'm zipping up a large folder (~100 Mbs), and I was wondering if there's a method to extract a zip file quicker, or if another standard can decompress files more quickly.
Thanks!

The short answer is that compression is always a trade off between speed and size. i.e. faster compression usually means smaller size - but unless you're using floppy disks to transfer the data, the time you gain by using a faster compression method means more network time to haul the data about. But having said that, the speed and compression ratio for different mathods varies depending on te structure of the file(s) you are compressing.
You also have to consider availability of software - is it worth spending the time downloading and compiling a compression program? I guess if its worth the time waiting for an answer here then either you're using an RFC1149 network or you're going to be doing this a lot.
In which case the answer is simple: test the programs yourself using a representative dataset.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string