Zip / 7zip Compression Differences - zip

I have a number of zip files that I need to distribute to users, around 130 of them. Each zip file contains a number of similar text, html, xml, and jpg files. In total, the zip files total 146 megabytes; unzipped, their contents total 551mb.
I want to distribute all these files together to users in as small a format as possible. I looked into two different ways of doing it, each using two different compression schemes, zip and 7zip (which I understand is either LZMA or a variant thereof):
Compress all the zip files into a compressed file and send that file (single.zip/7z)
Compress the unzipped contents of the zip files into a compressed file and send that file (combined.zip/7z)
For example, say that I have 3 zip files, A.zip, B.zip and C.zip, each of which contains one text file, one html file, and one XML file. With method 1, a single compressed file would be created containing A.zip, B.zip and C.zip. With method 2, a single compressed file would be created containing A.txt, A.html, A.xml, B.txt, B.html, B.xml, C.txt, C.html, and C.xml.
My assumption was that under either compression scheme, the file generated by method 2 would be smaller or at least the same size as the file generated by method 1, as you might be able to exploit efficiencies by considering all the files together. At the very least, method 2 would avoid the overhead of multiple zip files.
The surprising results (the sizes of files generated by the 7zip tool) were as follows:
single.zip - 142mb
single.7z - 124mb
combined.zip - 149mb
combined.7z - 38mb
I'm not surprised that the 7zip format produced smaller files than the zip format (result 2/4 vs result 1/3), as it generally compresses better than zip. What was surprising was that for the zip format, compressing all 130 zip files together resulted in a smaller output file than compressing all their uncompressed contents (result 3 vs result 1).
Why is it more efficient to zip several zip files together, than to zip their unzipped contents together?
The only thing I can think of is that during compression, the 7zip format builds a dictionary across all the file contents, so it can exploit similarities between files, while the zip format builds the dictionary per-file. Is that true? And even that still doesn't explain why result 3 was 7mb larger than result 1.
Thanks for your help.

Both .zip and .7z are lossless compression formats. .7z is newer and is likely to give you a better compression ratio, but it's not as widely supported as .zip, and I think it's somewhat more computationally expensive to compress/decompress.
The how much better is dependent on the types of files you are compressing but according to the wikipedia article on 7zip
In 2011, TopTenReviews found that the 7z compression was at least 17% better than ZIP, and 7-Zip's own site has since 2002 reported that while compression ratio results are very dependent upon the data used for the tests, "Usually, 7-Zip compresses to 7z format 30–70% better than to zip format, and 7-Zip compresses to zip format 2–10% better than most other zip-compatible programs."

Why is it more efficient to zip several zip files together, than to zip their unzipped contents together?
Your assumption is correct: 7zip uses Solid compression which zip does not. And it works similar to your dictionary idea. By combining common parts of different files into one 'block' and so reducing the size.

Related

Why do some compression algorithms not change the size of the archive file, and some do?

This is a list of file sizes that were compressed. They are all originally from \dev\urandom, thus pretty pseudo random. The yellow file is the original size of any and all of them.
The pink files are the compressed sizes, and there are 30 in total. The blue files are the compressed sizes of the pink files once they have been randomly shuffled using Python's shuffle command. The compression algorithms were LZMA, BZ2 and ZLIB respectively from Python's inbuilt modules.
You can see that there are three general blocks of output sizes. Two blocks have a size difference of exactly zero, whilst the other block has randomnesque file size differences as I'd expect due to arbitrary runs and correlations that appear when files as randomly shuffled. BZ2 files have changed by up to 930 bytes. I find it mystifying that LZMA and ZLIB algorithms produced exactly the same filesizes. It's as if there is some code inside specifically looking for no effect.
Q: Why do the BZ2 files significantly change size, whilst shuffling LZMA and ZLIB files has no effect on their compressibility.

efficient re-compress method (at highest compression rate) compress existing file at better compression

We have lots of gz (single file) and tar.gz (compressed folders) on different folders
most of them compress at -5 or -9. i was thinking to re-compress with highest compression -11, what is the best solution ( we are limited in storage space but have plenty of cpu power)
I tried to extracting all gz files in a folder but we dont have enough free space to do it
what is the best approach ?
Simply do one gz file at a time.

Zip File - is it possible to paginate through the file data?

Say I have a really large zip file (80GB) containing one massive CSV file (> 200GB).
Is it possible to fetch a subsection of the 80GB file data, modify the central directory, and extract just that bit of data?
Pictorial representation:
Background on my problem:
I have a cyclic process that does a summing on a certain column of a large zipped CSV file stashed in the cloud.
What I do today is I stream the file to my disk, extract it and then stream the file line by line. This makes is a very disk bound operation. Disk IS the bottle neck for sure.
Sure, I can leverage other cloud services to get what I need faster but that is not free.
I'm curious if I can see speed gains by just taking 1GB sub sections of zip until there's nothing left to read.
What I know:
The Zip file is stored using the deflate compression algorithm (always)
In the API I use to get the file from the cloud, I can specify a byte range to filter to. This means I can seek through the bytes of a file without hitting disk!
According the zip file specs there are three major parts to a zip file in order:
1: A header describing the file and it's attributes
2: The raw file data in deflated format
3: The central directory listing out what files start and stop and what bytes
What I don't know:
How the deflate algorithm works exactly. Does it jumble the file up or does it just compress things in order of the original file? If it does jumble, this approach may not be possible.
Had anyone built a tool like this already?
You can always decompress starting from the beginning, going as far as you like, keeping only the last, say, 1 GB, once you get to where you want. You cannot just start decompressing somewhere in the middle. At least not with a normal .zip file that has not been very specially prepared somehow for random access.
The central directory has nothing to do with random access of a single entry. All it can do is tell you where an entry starts and how long it is (both compressed and uncompressed).
I would recommend that you reprocess the .zip file into a .zip file with many (~200) entries, each on the order of 1 GB uncompressed. The resulting .zip file will be very close to the same size, but you can then use the central directory to pick one of the 200 entries, randomly access it, and decompress just that one.

Will gzip'd files use multiple AU's in DataLake Analytics?

In the EXTRACT documentation there's the (awesome) auto-magic support for gzipped files (which we are using).
But should I assume it won't use more than one AU? As if I understand correctly the files need to be "splitable" to spread across AUs?
Or will it split across AU's once extracted-on-the-fly and / or do gziped files have an index to indicate where they can be split somehow?
Or perhaps I'm muddling the vertex concept with AUs?
This is a good question :).
In general, if the file format is splitable (e.g., basically row-oriented with rows being less than the rowsize limit, which currently is 4MB), then large files will be split into 1GB per vertex.
However, GZip itself is not a splitable format. Thus we cannot split a GZip file during decompression and we end up not splitting the processing of the decompressed file either (the current framework does not provide this). As a consequence, we limit the size of a GZip file to 4GB. If you want scale out with GZip files, we recommend to split the data into several GZip files and then use file sets to scale out processing.

Is the compression used for gzip and zip binary compatible?

I've got a lot of files which are being produced direct to gzip format. I'd like to re-package these into zip archives grouping many files together.
Given the size and volume of these files (0.5GB uncompressed per file and thousands of files) I really want to avoid decompressing these only to re-compress them if the only technical difference is the file headers.
Given that both file formats use / support the Deflate algorithm, is it possible for me to simply strip the file header off a gzip file and patch the compressed data into a zip file without re-encoding (but adding an appropriate header entry). Or is there a difference in the binary encoding of the compressed data?
Yes, you can use the compressed data from gzip files directly to construct a zip file.

Resources