Linux file system automatically backed by disk but hosted entirely in memory? - linux

I have to do lots of small random accesses to a whole bunch of files. I have more than enough main memory to hold all of the data.
When I copy the data over to a temporary ramfs filesystem and process it there, this takes only a small fraction of the time that waiting for disk access would take.
Is there a Linux file system which holds all of its data in main memory, writes any changes to a backing disk, but never touches the disk for any reads?
If not, can, say, ext3 caches be tuned so that they are guaranteed to hold 100% of data and metadata?

If you are only reading data, then you can indeed tune caching such that all data will be cached in RAM - see /usr/src/linux/Documentation/sysctl/fs.txt vm.txt for details of what you can tweak here. The problem arises when you write data, particularly if you use fsync() or similar to ensure the data has been commited to the actual disk.
As the OS has to update the disk in the case of a fsync(), there's not much you can do if you still want to ensure your data is consistant and wouldn't be lost in a power cut.
One problem you might be running into is the atime or access time - by default every time a file is accessed the access time is updated in the inode. This will cause disk writes even when you think you are just performing reads. This can be a particular problem in your scenario where you are accessing many small files. If you don't care about tracking the access time you can mount your filesystem with the noatime to disable this 'feature'.

Why don't you try to create a RAID mirror between a ramdisk and a physical disk ?
Not sure if it's efficient though. If the mirror must always be synchronized, it will have to wait for the disk anyway when you write, but for reading you should gain something.
But yeah, to me it looks very much a complicated, wheel reinvented square IO caching :)
Would be a nice experiment, though.

take a look at this :
http://freecode.com/articles/virtual-filesystem-building-a-linux-filesystem-from-an-ordinary-file
You can mount a file as a FS into a RAMdisk, then backup it as a file.
Don't sure you want to backup frequently, but it is a good solution to save all the virtual disk in only one time.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?
The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.
and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

How can I temporally write a file to the filesystem but actually hold it on RAM?

I have a server that receives a file in a HTTP request, I'd like to make that file available to another process but I don't want the I/O overhead of writing that file to disk.
Are there any directories in linux that are actually mapped to RAM, so the process I start can access a path as is were a normal file?
I know that if I do this in a normal file, then there is a good chance that the file wont actually be flushed to disk because of cache, but that's not what I'm looking for.
There are no guaranteed locations that are backed by RAM, but it's not particularly hard to convert /tmp to be backed by RAM if you have enough of RAM to spare. Given that /tmp is cleaned out on boot anyway, it's an ideal choice for a RAM disk, since data loss due to power loss doesn't matter; the data would have been cleaned on boot anyway.
You can use the following to create a RAM disk (as per these instructions):
mkdir /mnt/ramdisk
mount -t ramfs -o size=512m ramfs /mnt/ramdisk
In case you want to make sure a specific file is/stays in RAM, you can use vmtouch (https://hoytech.com/vmtouch/). However, the file will be written to disk.

stop file caching for a process and its children

I have a process that reads thousands of small files ONE TIME. The cached data is not needed after this. The process proceeds at full speed until most memory is consumed by the file cache and then it slows down. I don't understand the slowdown, since freeing cache memory and allocating space for the next file should be a matter of microseconds. Hard page faults also increase when this threshold is reached. The OS is vanilla Ubuntu 16.04.
I would like to limit the file caching for this process only.
This is a user process, so using a privileged shell command to purge the cache is not a solution. Using fadvise on a per-file level is not a solution, since the files are being read my multiple library programs depending on the file type.
What I need is a process-level option: do not cache, or set a low size limit like 100 MB. I have searched for this and found nothing. Is this really the case? Seems like something big that is missing.
Any insight on the apparent memory management performance issue?
Here's the strict answer to your question. If you are mmap-ing your files, the way to do this is using madvise() and MADV_DONTNEED:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the ker‐
nel can free resources associated with it.) Subsequent
accesses of pages in this range will succeed, but will result
either in reloading of the memory contents from the underlying
mapped file (see mmap(2)) or zero-fill-on-demand pages for
mappings without an underlying file.
There is to my knowledge no way of doing it with files that are simply opened, read (using read() or similar) and closed.
However, it sounds to me like this is not in fact the issue. Are you sure it's buffer / cache that is growing here, and not something else? (e.g. perhaps you are reading them into RAM and not freeing that RAM, or not closing them, or similar)
You can tell by doing:
echo 3 > /proc/sys/vm/drop_caches
if you don't get all the memory back, then it's your program which is leaking something.
I am convinced there is no way to stop file caching on a per-process level. The program must have direct control over file I/O, with access to the file descriptors so that madvise() can be used. You cannot do this if library functions are doing all the file reading and you are not willing to modify them. This does look like a design gap that should be filled.
HOWEVER: My assertion of some performance issue with memory management was wrong. The reason for the process slow-down as the file cache grows and free memory shrinks was something else: disk seek distances were growing during the process. Other tests have verified that allocating memory does not significantly slow down as the file cache grows and free memory shrinks.

Using a hard disk without filesystem for big data

I'm working on a web crawler and have to handle big data (about 160 TB raw data in trillions of data files).
The data should be stored sequencial as one big bz2 file on the magnetic hard disk. A SSD is used to hold the meta data. THe most important operation on the hard disk is a squential read over all of the 4 TB off the disk, which should happen with full maximum speed of 150 MB/s.
I want to not waste the overhead of a file system an instead use the "/dev/file" devices directly. Does this access use the os block buffer? Are the access operations queued or synchronous in a FIFO style?
Is it better to use /dev/file or write your own user level file system?
Has anyone experience with it.
If you don't use any file system but read your disk device (e.g. /dev/sdb) directly, you are losing all the benefit of file system cache. I am not at all sure it is worthwhile.
Remember that you could use syscalls like readahead(2) or posix_fadvise(2) or madvise(2) to give hints to the kernel to improve performance.
Also, you might when making your file system use a larger than usual block size. And don't forget to use big blocks (e.g. of 64 to 256 Kbytes) when read(2)-ing data. You could also use mmap(2) to get the data from disk.
I would recommend against "coding your own file system". Existing file systems are quite tuned (and some are used on petabytes of storage). You may want to chose big blocks when making them (e.g. -b with mke2fs(8)...)
BTW, choosing between filesystem and raw disk data is mostly a configuration issue (you specify a /dev/sdb path if you want raw disk, and /home/somebigfile if you want a file). You could code a webcrawler to be able to do both, then benchmark both approaches. Very likely, performance could depend upon actual system and hardware.
As a case in point, relational database engines used often raw disk partitions in the previous century (e.g. 1990s) but seems to often use big files today.
Remember that the real bottleneck is the hardware (i.e. disk): CPU time used by filesystems is often insignificant and cannot even be measured.
PS. I have not much real recent experience with these issues.

Read file without disk caching in Linux

I have a C program that runs only weekly, and reads a large amount of files only once. Since Linux also caches everything that's read, they fill up the cache needlessly and this slows down the system a lot unless it has an SSD drive.
So how do I open and read from a file without filling up the disk cache?
Note:
By disk caching I mean that when you read a file twice, the second time it's read from RAM, not from disk. I.e. data once read from the disk is left in RAM, so subsequent reads of the same file will not need to reread the data from disk.
I believe passing O_DIRECT to open() should help:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
There are further detailed notes on O_DIRECT towards the bottom of the man page, including a fun quote from Linus.
You can use posix_fadvise() with the POSIX_FADV_DONTNEED advice to request that the system free the pages you've already read.

Resources