Performance: Better to unzip from hard disk? - zip

I'm about to make a design decision that could potentially have visible performance implications. Generally speaking, how do libraries handle unzipping; is it cheaper to unzip a file from memory or from hard disk?
I imagine this varies from library to library, but what about zlib — just an example of a more popular library —, when it extracts from hard disk does it first copy the data to memory anyway (meaning there's no performance difference between the two approaches), or is it able to extract directly from the hard disk?

By default, zlib will read a file "chunk by chunk" dependent on a predefined buffer size; this allows it to compress/uncompress data larger than available system memory.
Since reads from disk are expensive (when compared to reads from memory), loading a file into memory first would provide an improvement in performance, for files larger than the default buffer size and smaller than available memory. Performance will increase the larger multiple the file is to the buffer size and the less fragmented the file is on disk.

Related

How can mmap make large file processing faster?

What I know is that mmap can map process' virtual memory pages to the ones of a file on a disk. We can write and read to and from the memory in a program and it gets reflected in a file's content.
How can this machinery make sequential read (and perhaps processing) of a file faster than, for instance, regular read sys-call? How can it make search (binary search if file is sorted) faster?
I've got it from several sources that mmap does accomplish what I said, but I couldn't find any elaboration on that.
Since the limiting factor is the reading from disk, it probably isn't faster... With both methods you can configure a read-ahead to speed up sequential reading, which probably is the best you can do.
mmap()-ing a file however has other advantages compared to read()ing it: You do not have to care about the memory management. If the file is very large (exceeding the memory you wish to use in your process), you would have to manage yourself which parts of the file you keep and which you discard. In the case of mmap, the usual memory management routines from the OS decide, which parts of your file remain in memory and which are to discard in the case of memory contention, keeping an eye on the memory usage of the whole system, and not only your process. If you decide, that some parts have to remain always in memory, you can mlock() those.
But I do not see a big performance gain in the general case.

Using a hard disk without filesystem for big data

I'm working on a web crawler and have to handle big data (about 160 TB raw data in trillions of data files).
The data should be stored sequencial as one big bz2 file on the magnetic hard disk. A SSD is used to hold the meta data. THe most important operation on the hard disk is a squential read over all of the 4 TB off the disk, which should happen with full maximum speed of 150 MB/s.
I want to not waste the overhead of a file system an instead use the "/dev/file" devices directly. Does this access use the os block buffer? Are the access operations queued or synchronous in a FIFO style?
Is it better to use /dev/file or write your own user level file system?
Has anyone experience with it.
If you don't use any file system but read your disk device (e.g. /dev/sdb) directly, you are losing all the benefit of file system cache. I am not at all sure it is worthwhile.
Remember that you could use syscalls like readahead(2) or posix_fadvise(2) or madvise(2) to give hints to the kernel to improve performance.
Also, you might when making your file system use a larger than usual block size. And don't forget to use big blocks (e.g. of 64 to 256 Kbytes) when read(2)-ing data. You could also use mmap(2) to get the data from disk.
I would recommend against "coding your own file system". Existing file systems are quite tuned (and some are used on petabytes of storage). You may want to chose big blocks when making them (e.g. -b with mke2fs(8)...)
BTW, choosing between filesystem and raw disk data is mostly a configuration issue (you specify a /dev/sdb path if you want raw disk, and /home/somebigfile if you want a file). You could code a webcrawler to be able to do both, then benchmark both approaches. Very likely, performance could depend upon actual system and hardware.
As a case in point, relational database engines used often raw disk partitions in the previous century (e.g. 1990s) but seems to often use big files today.
Remember that the real bottleneck is the hardware (i.e. disk): CPU time used by filesystems is often insignificant and cannot even be measured.
PS. I have not much real recent experience with these issues.

Read file without disk caching in Linux

I have a C program that runs only weekly, and reads a large amount of files only once. Since Linux also caches everything that's read, they fill up the cache needlessly and this slows down the system a lot unless it has an SSD drive.
So how do I open and read from a file without filling up the disk cache?
Note:
By disk caching I mean that when you read a file twice, the second time it's read from RAM, not from disk. I.e. data once read from the disk is left in RAM, so subsequent reads of the same file will not need to reread the data from disk.
I believe passing O_DIRECT to open() should help:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
There are further detailed notes on O_DIRECT towards the bottom of the man page, including a fun quote from Linus.
You can use posix_fadvise() with the POSIX_FADV_DONTNEED advice to request that the system free the pages you've already read.

how to indicate to ext4 the size of a file before doing the write?

I'm curious if there is a way to do this? my understanding of ext4 is limited, but I do believe it has the capability to allocate contiguous ranges of disk space
I'm writing a file, from RAM, and know its size before the open(). Is there a way I can indicate this to the filesystem? are there performance benefits?
It seems that you're looking for posix_fallocate.
Using this allows the filesystem to allocate blocks up-front, which can reduce the fragmentation of the file. In particular, this matters for applications that randomly write chunks of the file (think bittorrent clients). For an application that writes a file sequentially, it's probably not worth it.

Why does MongoDB's memory mapped files cause programs like top to show larger numbers than normal?

I am trying to wrap my head around the internals of mongodb, and I keep reading about this
http://www.theroadtosiliconvalley.com/technology/mongodb-mongo-nosql-db/
Why does this happen?
So the way memorry mapped files work is that the addresses in memory are mapped byte for byte with a file on disk. This makes it really fast and but really large. Imagine a file on disk for your data taking up that size of memory.
Why it's awesome
In practice, this rocks because writing and reading from memory directly instead of issuing a system call (think context switch) is fast. Also, in practice, the fact that this huge memory mapped chunk doesn't fit in your physical ram is fine. Why? You only need the working set of data to fit in ram because the non-used pages are not loaded and just kept on disk. If they are needed a page fault happens and it gets loaded up. (I believe the portion that has been loaded is referred to as resident memory)
Why it it kind of sucks
Files mapped in memory needs to be page aligned so if you don't use up the memory space on the page boundary exactly you waste space (small tradoff)
Summary (tldnr)
It may look like its taking up a lot of resources because its mapping the entirety of your data to memory addresses but it doesn't really matter as that data isn't actually all being held in RAM. Mongo will pull in data as it needs it and use memory effectively to maintain a performant working set.

Resources