Is there a way to make zip or other compressed files that extract more quickly? - zip

I'd like to know if there's a way to make a zip file, or any other compressed file (tar,gz,etc) that will extract as quickly as possible. I'm just trying to move one folder to another computer, so I'm not concerned with the size of the file. However, I'm zipping up a large folder (~100 Mbs), and I was wondering if there's a method to extract a zip file quicker, or if another standard can decompress files more quickly.
Thanks!

The short answer is that compression is always a trade off between speed and size. i.e. faster compression usually means smaller size - but unless you're using floppy disks to transfer the data, the time you gain by using a faster compression method means more network time to haul the data about. But having said that, the speed and compression ratio for different mathods varies depending on te structure of the file(s) you are compressing.
You also have to consider availability of software - is it worth spending the time downloading and compiling a compression program? I guess if its worth the time waiting for an answer here then either you're using an RFC1149 network or you're going to be doing this a lot.
In which case the answer is simple: test the programs yourself using a representative dataset.

Related

Is there a way to dynamically determine the vhdSize flag?

I am using the MSIX manager tool to convert a *.msix (an application installer) to a *.vhdx so that it can be mounted in an Azure virtual machine. One of the flags that the tool requires is -vhdSize, which is in megabytes. This has proven to be problematic because I have to guess what the size should be based off the MSIX. I have ran into numerous creation errors due to too small of a vhdSize.
I could set it to an arbitrarily high value in order to get around these failures, but that is not ideal. Alternatively, guessing the correct size is an imprecise science and a chore to do repeatedly.
Is there a way to have the tool dynamically set the vhdSize, or am I stuck guessing a value that is both large enough to accommodate the file, but not too large as to waste disk space? Or, is there a better way to create a *.vhdx file?
https://techcommunity.microsoft.com/t5/windows-virtual-desktop/simplify-msix-image-creation-with-the-msixmgr-tool/m-p/2118585
There is an MSIX Hero app that could select a size for you, it will automatically check how big the uncompressed files are, add an extra buffer for safety (currently double the original size), and round it to the next 10MB. Reference from https://msixhero.net/documentation/creating-vhd-for-msix-app-attach/

Ext4 on magnetic disk: Is it possible to process an arbitrary list of files in a seek-optimized manner?

I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks ("randomized" by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don't care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
extract metadata from media files (ID3, EXIF etc.) (files may be large, but only some small parts of the files are read, so they become effectively smaller)
compute smaller versions of all JPEG images to process them with a classifier
reading portions of the storage for compression and/or encryption (e.g. put all files newer than X and smaller than Y in a tar archive)
extract the headlines of all Word documents
recompute all MD5 hashes to verify data integrity
While researching for this question, I learned of the FIBMAP ioctl command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file's inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.
A file (especially when it is large enough) is scattered on several blocks on the disk (look e.g. in the figure of ext2 wikipage, it still is somehow relevant for ext4, even if details are different). More importantly, it could be in the page cache (so won't require any disk access). So "sorting the file list by disk location" usually does not make any sense.
I recommend instead improving the code accessing these files. Look into system calls like posix_fadvise(2) and readahead(2).
If the files are really small (hundreds of bytes each only), it is probable that using something else (e.g. sqlite or some real RDBMS like PostGreSQL, or gdbm ...) could be faster.
BTW, adding more RAM could enlarge the page cache size, so the overall experience. And replacing your HDD by some SSD would also help.
(see also linuxatemyram)
Is it possible to sort a list of files to optimize read speed / minimize seek times?
That is not really possible. File system fragmentation is not (in practice) important with ext4. Of course, backing up all your file system (e.g. in some tar or cpio archive) and restoring it sequentially (after making a fresh file system with mkfs) might slightly lower fragmentation, but not that much.
You might optimize your file system settings (block size, cluster size, etc... e.g. various arguments to mke2fs(8)). See also ext4(5).
Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner.
If the list is not too long (otherwise, split it in chunks of several hundred files each), you might open(2) each file there and use readahead(2) on each such file descriptor (and then close(2) it). This would somehow prefill your page cache (and the kernel could reorder the required IO operations).
(I don't know how effective is that in your case; you need to benchmark)
I am not sure there is a software solution to your issue. Your problem is likely IO-bound, so the bottleneck is probably the hardware.
Notice that on most current hard disks, the CHS addressing (used by the kernel) is some "logical" addressing handled by the disk controller and is not much related to physical geometry any more. Read about LBA, TCQ, NCQ (so today, the kernel has no direct influence on the actual mechanical movements of a hard disk head). I/O scheduling mostly happens in the hard disk itself (not much more in the kernel).

external multithreading sort

I need to implement external multithreading sort. I dont't have experience in multithreading programming and now I'm not sure if my algorithm is good anoth also I don't know how to complete it. My idea is:
Thread reads next block of data from input file
Sort it using standart algorith(std::sort)
Writes it to another file
After this I have to merge such files. How should I do this?
If I wait untill input file will be entirely processed until merge
I recieve a lot of temporary files
If I try to merge file straight after sort, I can not come up with
an algorithm to avoid merging files with quite different sizes, which
will lead to O(N^2) difficulty.
Also I suppose this is a very common task, however I cannot find good prepared algoritm in the enternet. I would be very grateful for such a link especially for it's c++ implementation.
Well, the answer isn't that simple, and it actually depends on many factors, amongst them the number of items you wish to process, and the relative speed of your storage system and CPUs.
But the question is why to use multithreading at all here. Data too big to be held in memory? So many items that even a qsort algorithm can't sort fast enough? Take advantage of multiple processors or cores? Don't know.
I would suggest that you first write some test routines to measure the time needed to read and write the input file and the output files, as well as the CPU time needed for sorting. Please note that I/O is generally A LOT slower than CPU execution (actually they aren't even comparable), and I/O may not be efficient if you read data in parallel (there is one disk head which has to move in and out, so reads are in effect serialized - even if it's a digital drive it's still a device, with input and output channels). That is, the additional overhead of reading/writing temporary files may more than eliminate any benefit from multithreading. So I would say, first try making an algorithm that reads the whole file in memory, sorts it and writes it, and put in some time counters to check their relative speed. If I/O is some 30% of the total time (yes, that little!), it's definitely not worth, because with all that reading/merging/writing of temporary files, this will rise a lot more, so a solution processing the whole data at once would rather be preferable.
Concluding, don't see why use multithreading here, the only reason imo would be if data are actually delivered in blocks, but then again take into account my considerations above, about relative I/O-CPU speeds and the additional overhead of reading/writing the temporary files. And a hint, your file accessing must be very efficient, eg reading/writing in larger blocks using application buffers, not one by one (saves on system calls), otherwise this may have a detrimental effect if the file(s) are stored on a machine other than yours (eg a server).
Hope you find my suggestions useful.

Strategies for playing (long) audio files from disk

I wanted to start a thread on this. A lot of people are wondering how to do it in a specific context or with a specific language, but I was wondering what's the best strategy in general
I see two main practices :
load small chunks (like 2048 samples) of the file in a buffer. It seems the most straightforward but it involves to use the disk the lot, so I suspect it is not the best.
load all the file in a big buffer. More gentle with the hardrive, but needs a lot of ram if you use several long files. And if your file is very long, or has a lot of channels, I imagine the index variable could get corrupted. For example if it's a 16bit integer maybe it cannot reach the end of the file (or am I paranoid ?)
and I'm thinking about hybrid things, like :
using very big buffers without loading the whole file
store the file in a custom format on hardrive, in a way that it's optimized for accessing it quickly.
So, what do you think, how do you deal with this ?
I don't really care what's the "best", I'm more wondering about the pros and cons of each.
Answering part of my own question (the part about hybrid solutions).
Audacity is using custom BlockFiles format for storing and playback. It encapsulates both the idea of big(-ger than callback) buffers which are around 1Mb and the idea of custom file type (.aup).
"BlockFiles balance two conflicting forces. We can insert and delete audio without excessive copying, and during playback we are guaranteed to get reasonably large chunks of audio with each request to the disk. The smaller the blocks, the more potential disk requests to fetch the same amount of audio data; the larger the blocks, the more copying on insertions and deletions." (from : http://www.aosabook.org/en/audacity.html)
From what I've read, it was primarly designed for speeding up the edition of very long files (for example inserting data at the beginning without having to move everything after).
But for playback of relatively short audio data (< 1 hour) I guess putting everything in RAM is just fine.

How to speed up reading of a fixed set of small files on linux?

I have 100'000 1kb files. And a program that reads them - it is really slow.
My best idea for improving performance is to put them on ramdisk.
But this is a fragile solution, every restart need to setup the ramdisk again.
(and file copying is slow as well)
My second best idea is to concatenate the files and work with that. But it is not trivial.
Is there a better solution?
Note: I need to avoid dependencies in the program, even Boost.
You can optimize by storing the files contiguous on disk.
On a disk with ample free room, the easiest way would be to read a tar archive instead.
Other than that, there is/used to be a debian package for 'readahead'.
You can use that tool to
profile a normal run of your software
edit the lsit of files accesssed (detected by readahead)
You can then call readahead with that file list (it will order the files in disk order so the throughput will be maximized and the seektimes minimized)
Unfortunately, it has been a while since I used these, so I hope you can google to the resepctive packages
This is what I seem to have found now:
sudo apt-get install readahead-fedora
Good luck
If your files are static, I agree just tar them up and then place that in a RAM disk. Probably be faster to read directly out of the TAR file, but you can test that.
edit:: instead of TAR, you could also try creating a squashfs volume.
If you don't want to do that, or still need more performance then:
put your data on an SSD.
start investigating some FS performance test, starting with EXT4, XFS, etc...

Resources