How do you prepare deflate streams for PIGZ (parallel gzip)? - linux

I am using the PIGZ library. https://zlib.net/pigz/
I compressed large files using multiple threads per file with this library and now I want to decompress those files using multiple threads per file too. As per the documentation:
Decompression can’t be parallelized, at least not without specially
prepared deflate streams for that purpose.
However, the documentation doesn't specify how to do that, and I'm finding it difficult to find information on this.
How would I create these "specically prepared deflate streams" that PIGZ can utilise for decompression?

pigz does not currently support parallel decompression, so it wouldn't help to specially prepare such a deflate stream.
The main reason this has not been implemented is that, in most situations, decompression is fast enough to be i/o bound, not processor bound. This is not the case for compression, which can be much slower than decompression, and where parallel compression can speed things up quite a bit.
You could write your own parallel decompressor using zlib and pthread. pigz 2.3.4 and later will in fact make a specially prepared stream for parallel decompression by using the --independent (-i) option. That makes the blocks independently decompressible, and puts two sync markers in front of each to make it possible to find them quickly by scanning the compressed data.The uncompressed size of a block is set with --blocksize or -b. You might want to make that size larger than the default, e.g. 1M instead of 128K, to reduce the compression impact of using -i. Some testing will tell you how much your compression is reduced by using -i.
(By the way, pigz is not a library, it is a command-line utility.)

Related

Decompressing large files using multiple computers

We have to deal with extracting gzip/bzip files over the internet, sometimes they are way over multiple gigabytes (eg. 15gb wiki dump).
Is there a way that those can be extracted by multiple computers instead of by one? Perhaps reading the header plus the bytes between X and Y by each node in the cluster, writing it into a shared folder?
Or any other way that can accelerate that process?
Have you considered using a parallelized alternative to gzip/bzip?
In the scenario that you are using bzip, pbzip2 is a parallelized alternative using pthreads to speedup download. In addtion, a parallel alternative to gzip is pgzip.

external multithreading sort

I need to implement external multithreading sort. I dont't have experience in multithreading programming and now I'm not sure if my algorithm is good anoth also I don't know how to complete it. My idea is:
Thread reads next block of data from input file
Sort it using standart algorith(std::sort)
Writes it to another file
After this I have to merge such files. How should I do this?
If I wait untill input file will be entirely processed until merge
I recieve a lot of temporary files
If I try to merge file straight after sort, I can not come up with
an algorithm to avoid merging files with quite different sizes, which
will lead to O(N^2) difficulty.
Also I suppose this is a very common task, however I cannot find good prepared algoritm in the enternet. I would be very grateful for such a link especially for it's c++ implementation.
Well, the answer isn't that simple, and it actually depends on many factors, amongst them the number of items you wish to process, and the relative speed of your storage system and CPUs.
But the question is why to use multithreading at all here. Data too big to be held in memory? So many items that even a qsort algorithm can't sort fast enough? Take advantage of multiple processors or cores? Don't know.
I would suggest that you first write some test routines to measure the time needed to read and write the input file and the output files, as well as the CPU time needed for sorting. Please note that I/O is generally A LOT slower than CPU execution (actually they aren't even comparable), and I/O may not be efficient if you read data in parallel (there is one disk head which has to move in and out, so reads are in effect serialized - even if it's a digital drive it's still a device, with input and output channels). That is, the additional overhead of reading/writing temporary files may more than eliminate any benefit from multithreading. So I would say, first try making an algorithm that reads the whole file in memory, sorts it and writes it, and put in some time counters to check their relative speed. If I/O is some 30% of the total time (yes, that little!), it's definitely not worth, because with all that reading/merging/writing of temporary files, this will rise a lot more, so a solution processing the whole data at once would rather be preferable.
Concluding, don't see why use multithreading here, the only reason imo would be if data are actually delivered in blocks, but then again take into account my considerations above, about relative I/O-CPU speeds and the additional overhead of reading/writing the temporary files. And a hint, your file accessing must be very efficient, eg reading/writing in larger blocks using application buffers, not one by one (saves on system calls), otherwise this may have a detrimental effect if the file(s) are stored on a machine other than yours (eg a server).
Hope you find my suggestions useful.

When not to do maximum compression in png?

Intro
When saving png images through GIMP, I've always used level 9 (maximum) compression, as I knew that it's lossless. Now I've to specify compression level when saving png format image through GD extension of PHP.
Question
Is there any case when I shouldn't compress PNG to maximum level? Like any compatibility issues? If there's no problem then why to ask user; why not automatically compress to max?
Each level of PNG compression requires significantly more memory and processing power to compress (and decompress to a lesser degree).
There is a rapid tailoff in the compression gains from each level, however, so choose one that balances the webserver resources available for compression with your need to reduce bandwith.

Why is *.tar.gz still much more common than *.tar.xz? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Whenever I see some source packages or binaries which are compressed with GZip I wonder if there are still reasons to favor gz over xz (excluding time travel to 2000), the savings of the LZMA compression algorithm are substantial and decompressions isn't magnitudes worse than gzip.
"Lowest Common Denominator". The extra space saved is rarely worth the loss of interoperability. Most embedded Linux systems have gzip, but not xz. Many old system as well. Gnu Tar which is the industry standard supports flags -z to process through gzip, and -j to process through bzip2, but some old systems don't support the -J flag for xz, meaning it requires 2-step operation (and a lot of extra diskspace for uncompressed .tar unless you use the syntax of |tar xf - - which many people don't know about.) Also, uncompressing the full filesystem of some 10MB from tar.gz on embedded ARM takes some 2 minutes and isn't really a problem. No clue about xz but bzip2 takes around 10-15 minutes. Definitely not worth the bandwidth saved.
The ultimate answer is accessibility, with a secondary answer of purpose. Reasons why XZ is not necessarily as suitable as Gzip:
Embedded and legacy systems are far more likely to lack sufficient available memory to decompress LZMA/LZMA2 archives such as XZ. As an example, if XZ can shave 400 KiB (vs. Gzip) off of a package destined for an OpenWrt router, what good is the minor space savings if the router has 16 MiB of RAM? A similar situation appears with very old computer systems. One might scoff at the thought of downloading and compiling the latest version of Bash on an ancient SparcStation LX with 32MB of RAM, but it happens.
Such systems usually have slow processors, and decompression time increases can be very high. Three seconds extra to decompress on your Core i5 can be severely long on a 200 MHz ARM core or a 50 MHz microSPARC. Gzip compression is extremely fast on such processors when compared to all better compression methods such as XZ or even Bzip2.
Gzip is pretty much universally supported by every UNIX-like system (and nearly every non-UNIX-like system too) created in the past two decades. XZ availability is far more limited. Compression is useless without the ability to decompress it.
Higher compression takes a lot of time. If compression time is more important than compression ratio, Gzip beats XZ. Honestly, lzop is much faster than Gzip and still compresses okay, so applications that need the fastest compression possible and don't require Gzip's ubiquity should look at that instead. I routinely shuffle folders quickly across a trusted LAN connection with commands such as "tar -c * | lzop -1 | socat -u - tcp-connect:192.168.0.101:4444" and Gzip could be used similarly over a much slower link (i.e. doing the same thing I just described through an SSH tunnel over the Internet).
Now, on the flip side, there are situations where XZ compression is vastly superior:
Sending data over slow links. The Linux 3.7 kernel source code is 34 MiB smaller in XZ format than in Gzip format. If you have a super fast connection, choosing XZ could mean saving one minute of download time; on a cheap DSL connection or a 3G cellular connection, it could shave an hour or more off the download time.
Shrinking backup archives. Compressing the source code for Apache's httpd-2.4.2 with "gzip-9" vs. "xz -9e" yields an XZ archive that is 62.7% the size of the Gzip archive. If the same compressibility exists in a data set you currently store as 100 GiB worth of .tar.gz archives, converting to .tar.xz archives would cut a whopping 37.3 GiB off of the backup set. Copying this entire backup data set to a USB 2.0 hard drive (maxing out around 30 MiB/sec transfers) as Gzipped data would take 55 minutes, but XZ compression would make the backup take 20 minutes less. Assuming you'll be working with these backups on a modern desktop system with plenty of CPU power and the one-time-only compression speed isn't a serious problem, using XZ compression generally makes more sense. Why shuffle around extra data if you don't need to?
Distributing large amounts of data that might be highly compressible. As previously mentioned, Linux 3.7 source code is 67 MiB for .tar.xz and 101 MiB for .tar.gz; the uncompressed source code is about 542 MiB and is almost entirely text. Source code (and text in general) is typically highly compressible because of the amount of redundancy in the contents, but compressors like Gzip that operate with a much smaller dictionary don't get to take advantage of redundancy that goes beyond their dictionary size.
Ultimately, it all falls back to a four-way tradeoff: compressed size, compression/decompression speed, copying/transmission speed (reading the data from disk/network), and availability of the compressor/decompressor. The selection is highly dependent on the question "what are you planning to do with this data?"
Also check out this related post from which I learned some of the things I repeat here.
From the author of Lzip compressing utility:
Xz has a complex format, partially specialized in the compression of
executables and designed to be extended by proprietary formats. Of the
four compressors tested here, xz is the only one alien to the Unix
concept of "doing one thing and doing it well". It is the less
appropriate for data sharing, and not appropriate at all for long-term
archiving.
In general, the more complex the format, the less probable that it can
be decoded in the future. But the xz format, just as its infamous
predecessor lzma-alone, is specially badly designed. Xz copies almost
all the defects of gzip and then adds some more, like the fragile
variable-length integers. Just one bit-flip in bit 7 of any byte of
one variable-length integer and the whole xz stream comes tumbling
down like a house of cards. Using xz for anything other than
compressing short-lived executables is not advisable.
Don't interpret me wrong. I am very grateful to Igor Pavlov for
inventing/discovering LZMA, but xz is the third attempt of his
followers to take advantage of the popularity of 7zip and replace gzip
and bzip2 with inappropriate or badly designed formats. In particular,
it is shameful that support for lzma-alone was implemented in both GNU
and Linux.
http://www.nongnu.org/lzip/lzip_benchmark.html
I did my own benchmark on 1.1GB Linux installation vmdk image:
rar =260MB comp= 85s decomp= 5s
7z(p7z)=269MB comp= 98s decomp=15s
tar.xz =288MB comp=400s decomp=30s
tar.bz2=382MB comp= 91s decomp=70s
tar.gz =421MB comp=181s decomp= 5s
all compression levels on max, CPU Intel I7 3740QM, Memory 32GB 1600, source and destination on RAM disk
I Generally use rar or 7z for archiving normal files like documents.
and for archiving system files I use .tar.gz or .tar.xz by file-roller or tar with -z or -J options along with --preserve to compress natively with tar and preserve permissions (also alternatively .tar.7z or .tar.rar can be used)
update: as tar only preserve normal permissions and not ACLs anyway, also plain .7z plus backup and restoring permissions and ACLs manually via getfacl and sefacl can be used which seems to be best option for both file archiving or system files backup because it will full preserve permissions and ACLs, has checksum, integrity test and encryption capability, only downside is that p7zip is not available everywhere
Honestly, I just get to know .xz format from a training material. So I just used its git repo to do a test. The git is git://git.free-electrons.com/training-materials.git, and I also compiled the three training slides. The total directory size is 91M, with a mixture of text and binary data.
Here is my quick result. Maybe people still favor tar.gz simply because it's much faster to compress? I personally even use plain tar when there aren't many benefits to be gained in compression.
[02:49:32]wujj#WuJJ-PC-Linux /tmp $ time tar czf test.tgz training-materials/
real 0m3.371s
user 0m3.208s
sys 0m0.128s
[02:49:46]wujj#WuJJ-PC-Linux /tmp $ time tar cJf test.txz training-materials/
real 0m34.557s
user 0m33.930s
sys 0m0.372s
[02:50:31]wujj#WuJJ-PC-Linux /tmp $ time tar cf test.tar training-materials/
real 0m0.117s
user 0m0.020s
sys 0m0.092s
[02:51:03]wujj#WuJJ-PC-Linux /tmp $ ll test*
-rw-rw-r-- 1 wujj wujj 91944960 2012-07-09 02:51 test.tar
-rw-rw-r-- 1 wujj wujj 69042586 2012-07-09 02:49 test.tgz
-rw-rw-r-- 1 wujj wujj 60609224 2012-07-09 02:50 test.txz
[02:56:03]wujj#WuJJ-PC-Linux /tmp $ time tar xzf test.tgz
real 0m0.719s
user 0m0.536s
sys 0m0.144s
[02:56:24]wujj#WuJJ-PC-Linux /tmp $ time tar xf test.tar
real 0m0.189s
user 0m0.004s
sys 0m0.108s
[02:56:33]wujj#WuJJ-PC-Linux /tmp $ time tar xJf test.txz
real 0m3.116s
user 0m2.612s
sys 0m0.184s
For the same reason people in Windows (r) use zip files instead of 7zip, and some still use rar instead of other formats... Or mp3 is used in music, instead of aac+, and so on.
Each format has it's benefits and people use to stick to a solution they learned when began using a computer. Add this to backward compatibility and fast bandwidth + GB or TB of space in hard drives, and the benefits of a greater compression won't be that relevant.
gz is supported everywhere and good for portability.
xz is newer and now as widely or well supported. It is more complex than gzip with more compression options.
This is not the only reason people might not always use xz. xz can take a very long time to compress, not a trivial amount of time so even if it can produce superior results it might not always be chosen. Another weakness is that it can use a lot of memory, especially for compression. The more you want to compress an item by the longer it takes and this is exponential with diminishing returns.
However, at compression level 1 for large binary items in my experience xz can often produce much smaller results in less time than zlib at level 9. This can sometimes be a very significant difference, in the same time as zlib, xz can make a file that is half the size of zlib's file.
bzip2 is in a similar situation, however xz has far superior advantages and a strong window where it performs significantly better all round.
Also one important point for gzip is that it's interoperable with rsync/zsync. This could be huge benefit regarding bandwidth in cases. LZMA/bzip2/xz doesn't support rsync and probably won't support it anytime soon.
One of characteristics of LZMA is that it uses quiet large window. To make it rsync/zsync friendly we would probably need to reduce this window which would degrade it's compression performance.
Yeah the thought I had is that the original question could be reposed these days as "why is tar.gz more common than tar.lz" (since lz seems to compress slightly better than xz, xz is said to be a poor choice for archiving, though does offer some nice features like random access). I suppose the answer is "momentum" people are used to using it, there's good library support, etc.etc. The introduction of lz may mean that xz will grow less fast now, as well, FWIW...
However, that being said, lz appears to decompress slower than xz, and there are new things on the horizon like Brotli so it's unclear what will happen in terms of popularity...but I have seem a few .lz files in the wild FWIW...

Is there a way to make zip or other compressed files that extract more quickly?

I'd like to know if there's a way to make a zip file, or any other compressed file (tar,gz,etc) that will extract as quickly as possible. I'm just trying to move one folder to another computer, so I'm not concerned with the size of the file. However, I'm zipping up a large folder (~100 Mbs), and I was wondering if there's a method to extract a zip file quicker, or if another standard can decompress files more quickly.
Thanks!
The short answer is that compression is always a trade off between speed and size. i.e. faster compression usually means smaller size - but unless you're using floppy disks to transfer the data, the time you gain by using a faster compression method means more network time to haul the data about. But having said that, the speed and compression ratio for different mathods varies depending on te structure of the file(s) you are compressing.
You also have to consider availability of software - is it worth spending the time downloading and compiling a compression program? I guess if its worth the time waiting for an answer here then either you're using an RFC1149 network or you're going to be doing this a lot.
In which case the answer is simple: test the programs yourself using a representative dataset.

Resources