Does link/rm/mv sync dentry metadata immediately when finished? - linux

Does link/rm/mv sync dentry metadata to permanent storage immediately when finished? If not, when?

From https://www.kernel.org/doc/Documentation/sysctl/vm.txt, I get this:
drop_caches
Writing to this will cause the kernel to drop clean caches, as well as
reclaimable slab objects like dentries and inodes. Once dropped,
their memory becomes free.
To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free reclaimable slab objects (includes dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
To free slab objects and pagecache:
echo 3 > /proc/sys/vm/drop_caches
This is a non-destructive operation and will not free any dirty
objects. To increase the number of objects freed by this operation,
the user may run `sync' prior to writing to /proc/sys/vm/drop_caches.
This will minimize the number of dirty objects on the system and
create more candidates to be dropped.
This file is not a means to control the growth of the various kernel
caches (inodes, dentries, pagecache, etc...) These objects are
automatically reclaimed by the kernel when memory is needed elsewhere
on the system.
Use of this file can cause performance problems. Since it discards
cached objects, it may cost a significant amount of I/O and CPU to
recreate the dropped objects, especially if they were under heavy use.
Because of this, use outside of a testing or debugging environment is
not recommended.
You may see informational messages in your kernel log when this file
is used:
cat (1234): drop_caches: 3
These are informational only. They do not mean that anything is wrong
with your system. To disable them, echo 4 (bit 3) into drop_caches.

Related

What cause kernel to eat CPU on page_fault?

hw/os: linux 4.9, 64G RAM.
16 daemons running. Each reading random short (100 bytes) pieces of 5GiB file accessing it as a memory mapped via mmap() at daemon startup. Each daemon reads its own file, so 16 5GiB files total.
Each daemon making maybe 10 reads per second. Not too much, disk load is rather small.
Sometimes (1 event in 5 minutes, no period, totally random) some random daemon stuck in kernel code with the followind stack (see picture) for 300 milliseconds. This does not corellate with major-faults: the major-faults go at constant rate about 100...200 per second. Disk reads are also constant.
What can cause this?
Text of the image: __list_del_entry isolate_lru_pages.isra.48 shrink_inactive_list shrink_node_memcg shrink_node node_reclaim get_page_from_freelist enqueue_task_fair sched_clock __alloc_pages_nodemask alloc_pages_vma handle_mm_fault __do_page_fault page_fault
You have shrink_node and node_reclaim functions in your stack. They are called to free memory (which is shown as buff/cache by free command-line tool): https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#reclaim
The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.
So your 64 GB RAM system has situation when there is no free memory left. This amount of memory is enough to hold a copy of 12 files of 5 GB each, and your daemons uses 16 files. Linux may read more data from files than it was required by application with Readahead technique ("Linux readahead: less tricks for more", ols 2007 pp273-284, man 2 readahead). MADV_SEQUENTIAL may also turn on this mechanism, https://man7.org/linux/man-pages/man2/madvise.2.html
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in
the given range can be aggressively read ahead, and may be
freed soon after they are accessed.)
MADV_RANDOM
Expect page references in random order. (Hence, read ahead
may be less useful than normally.)
Not sure how your daemons did open and read files, was MADV_SEQUENTIAL active for them or not (or was this flag added by glibc or any other library). Also there can be some effect from THP - Transpartent huge pages https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. Normal 4.9 kernel is from 2016 and thp expansion for filesystems was planned in 2019 https://lwn.net/Articles/789159/, but if you use RHEL/CentOS, some features may be backported into fork of 4.9 kernel.
You should check free and cat /proc/meminfo output periodically to check how your daemons and linux kernel readahead uses memory.

How can I shrink the Linux page cache from within kernel space?

I'm working on a system that involves some custom hardware and a custom Linux device driver I wrote for the hardware. The system occasionally needs to move large amounts of data very rapidly and therefore my driver dynamically (i.e. when needed) allocates large (1 GB) DMA buffers which are used and then freed when they are no longer needed. To allocate such large buffers I actually allocate a bunch of smaller buffers (256 X 4MB) using dma_alloc_coherent and then map them contiguously into user space using remap_pfn_range. This works very well most of the time.
During testing, after the system has been running test cases for a long time, I sometimes see DMA allocation failures where one of the dma_alloc_coherent calls in my driver fails which causes my application layer software to crash. I was finally able to track down this problem and I discovered that when I see DMA allocation failures the Linux kernel page cache is very full.
For example, on the last failure that I captured the page cache filled 27 GB of the 32 GB of RAM on my system. I suspected that the page cache "fullness" was causing dma_alloc_coherent calls to fail. To test this theory I manually emptied the page cache using:
# echo 1 > /proc/sys/vm/drop_caches
This dropped the size of the cache from 27 GB to 94 MB and I was able to allocate 20+ 1 GB DMA buffers with no issues.
Clearly the page cache is a beneficial thing so I would prefer not to have to completely empty it every time I run out of space when allocating DMA buffers. My questions is this: how can I dynamically shrink the page cache in kernel space such that if a call to dma_alloc_coherent fails I can recover just enough space so that I can retry the call and have it succeed?
My system is x86_64 based running a 3.16.x Linux kernel.
I have found some vague references that suggest what I'm attempting may be possible, for example "These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system." (from: https://www.kernel.org/doc/Documentation/sysctl/vm.txt). But I have not yet found any specifics that indicate how the memory is reclaimed.
Any assistance with this would be greatly appreciated!
TL;DR : Scan for active superblocks and drop references to non-dirty ones until you have reclaimed as much system memory as you need. (or you finally run out of references to active superblocks.)
How to write kernel code to dynamically shrink the fs page-cache,
to recover just enough space so that a subsequent call to dma_alloc_coherent() succeeds?
To answer this question, let us take a look at what the "drop_caches operation" did to reduce the fs page-cache from 27GB to 94MB on your system.
echo 1 > /proc/sys/vm/drop_caches
invokes
drop_caches_sysctl_handler()
which in turn invokes iterate_supers() and
passes it the pointer to the function drop_pagecache_sb().
What happens next is that iterate_supers() scans for active superblocks and everytime it finds one, it calls drop_pagecache_sb(), passing it a reference to the active superblock.
This iterative procedure continues until references to all the active superblocks are freed from the fs page-cache. This is a non-destructive operation and will only free blocks that are completely unused. Dirty-objects will continue to be in use until written out to disk and are not free-able. If you run sync first to flush them out to disk, the "drop_caches operation" tends to free more memory.
Since you are interested in running this process to reclaim a limited/known amount of memory i.e. what is soon going to be requested using dma_alloc_coherent(), you simply need to implement the above functionality with an additional check at the end of each iteration and abort the superblock scan immediately once the amount of free system memory crosses the desired level.
A couple of points to keep in mind to further optimise this procedure :
Is there a preference for certain block devices over others?
You may want to iterate over active superblocks of the block devices that you do not care about first. If enough memory is not reclaimed, then scan the block devices that you would prefer to retain in the fs page-cache unless absolutely necessary to reclaim required memory. get_active_super() might be of help here.
iterate_supers_type() seems interesting
It allows one to iterate over superblocks of specific file_system_type
Please note that this is a speculative solution based purely on the analysis of existing code within the Linux kernel that you have observed to already solve your problem. Once the above approach is implemented, it will only allow you to control the same i.e. attempt to reclaim fs page-cache memory only to the extent required for your immediate needs.
Technically when certain allocation fails then Kernel will try to free memory.Depending upon memory failures(soft failure/hard failure). Hard failures causes Kernel to enter into direct reclaim path. Direct reclaim is costly operation which might take undefined time to complete and even after that allocation might fail.
Here you have two options:
1) Play with VM settings like dirty_ratio,dirty_background_ratio etc to maintain free ram. see : https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html
2) Write a kernel daemon, which calls kernel function which handles drop_cache (because drop_cache migh sleep).

Drop cache does not work

I am currently working on optimizing the memory management of a large program. For some pupose, I want to drop the page cache in my main memory.
I used sync && echo 3 > /proc/sys/vm/drop_caches as widely suggested by the internet, but it does not drop the cache to the level where it was before the program starts. This means there are some undroppable cache in the main memory after the program starts.
But isn't echo 3 means to free pagecache, dentries and inodes in cache memory? Is there any other kinds of cache that cannot be freed by this command?
Yes, there are some types of caches that cannot be dropped. For instance, tmpfs filesystems are stored in page cache. But these could not be flushed while in use. You can get better picture of how much memory you really have available by using free command, and checking available column. You'll notice that available memory is smaller than free + buffers + caches. Sometimes much smaller.
For more information on tmpfs using caches see this answer.
Collect output of cat /proc/vmstat before and after you issue drop cache.
It will give nr_inactive_file,nr_active_file ,nr_file_pages,nr_isolated_file. If drop cache works then total of above 4 should be less than before issuing drop cache.

What is the difference between buffer and cache memory in Linux?

To me it's not clear what's the difference between the two Linux memory concepts : buffer and cache. I've read through this post and it seems to me that the difference between them is the expiration policy:
buffer's policy is first-in, first-out
cache's policy is Least Recently Used.
Am I right?
In particular, I'm looking at the two commands: free and vmstat
james#utopia:~$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 0 173 67 912 0 0 19 59 75 1087 24 4 71 1
james#utopia:~$ free -m
total used free shared buffers cached
Mem: 2007 1834 172 0 67 914
-/+ buffers/cache: 853 1153
Swap: 2859 0 2859
Buffers are associated with a specific block device, and cover caching
of filesystem metadata as well as tracking in-flight pages. The cache
only contains parked file data. That is, the buffers remember what's
in directories, what file permissions are, and keep track of what
memory is being written from or read to for a particular block device.
The cache only contains the contents of the files themselves.
quote link
Cited answer (for reference):
Short answer: Cached is the size of the page cache. Buffers is the size of in-memory block I/O buffers. Cached matters; Buffers is largely irrelevant.
Long answer: Cached is the size of the Linux page cache, minus the memory in the swap cache, which is represented by SwapCached (thus the total page cache size is Cached + SwapCached). Linux performs all file I/O through the page cache. Writes are implemented as simply marking as dirty the corresponding pages in the page cache; the flusher threads then periodically write back to disk any dirty pages. Reads are implemented by returning the data from the page cache; if the data is not yet in the cache, it is first populated. On a modern Linux system, Cached can easily be several gigabytes. It will shrink only in response to memory pressure. The system will purge the page cache along with swapping data out to disk to make available more memory as needed.
Buffers are in-memory block I/O buffers. They are relatively short-lived. Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers is raw disk blocks not represented in the page cacheā€”i.e., not file data. The Buffers metric is thus of minimal importance. On most systems, Buffers is often only tens of megabytes.
"Buffers" represent how much portion of RAM is dedicated to cache disk blocks. "Cached" is similar like "Buffers", only this time it caches pages from file reading.
quote from:
https://web.archive.org/web/20110207101856/http://www.linuxforums.org/articles/using-top-more-efficiently_89.html
It's not 'quite' as simple as this, but it might help understand:
Buffer is for storing file metadata (permissions, location, etc). Every memory page is kept track of here.
Cache is for storing actual file contents.
Explained by Red Hat:
Cache Pages:
A cache is the part of the memory which transparently stores data so that future requests for that data can be served faster. This memory is utilized by the kernel to cache disk data and improve i/o performance.
The Linux kernel is built in such a way that it will use as much RAM as it can to cache information from your local and remote filesystems and disks. As the time passes over various reads and writes are performed on the system, kernel tries to keep data stored in the memory for the various processes which are running on the system or the data that of relevant processes which would be used in the near future. The cache is not reclaimed at the time when process get stop/exit, however when the other processes requires more memory then the free available memory, kernel will run heuristics to reclaim the memory by storing the cache data and allocating that memory to new process.
When any kind of file/data is requested then the kernel will look for a copy of the part of the file the user is acting on, and, if no such copy exists, it will allocate one new page of cache memory and fill it with the appropriate contents read out from the disk.
The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere in the disk. When some data is requested, the cache is first checked to see whether it contains that data. The data can be retrieved more quickly from the cache than from its source origin.
SysV shared memory segments are also accounted as a cache, though they do not represent any data on the disks. One can check the size of the shared memory segments using ipcs -m command and checking the bytes column.
Buffers:
Buffers are the disk block representation of the data that is stored under the page caches. Buffers contains the metadata of the files/data which resides under the page cache.
Example: When there is a request of any data which is present in the page cache, first the kernel checks the data in the buffers which contain the metadata which points to the actual files/data contained in the page caches. Once from the metadata the actual block address of the file is known, it is picked up by the kernel for processing.
buffer and cache.
A buffer is something that has yet to be "written" to disk.
A cache is something that has been "read" from the disk and stored for later use.
I think this page will help understanding the difference between buffer and cache deeply. http://www.tldp.org/LDP/sag/html/buffer-cache.html
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Since memory is, unfortunately, a finite, nay, scarce resource, the buffer cache usually cannot be big enough (it can't hold all the data one ever wants to use). When the cache fills up, the data that has been unused for the longest time is discarded and the memory thus freed is used for the new data.
Disk buffering works for writes as well. On the one hand, data that is written is often soon read again (e.g., a source code file is saved to a file, then read by the compiler), so putting data that is written in the cache is a good idea. On the other hand, by only putting the data into the cache, not writing it to disk at once, the program that writes runs quicker. The writes can then be done in the background, without slowing down the other programs.
Seth Robertson's Link 2 said "For thorough understanding of those terms, refer to Linux kernel book like Linux Kernel Development by Robert M. Love."
I found some contents about 'buffer' in the 2nd edition of the book.
Although the physical device itself is addressable at the sector level, the kernel performs all disk operations in terms of blocks.
When a block is stored in memory (say, after a read or pending a write), it is stored in a 'buffer'. Each 'buffer' is associated with exactly one block. The 'buffer' serves as the object that represents a disk block in memory.
A 'buffer' is the in-memory representation of a single physical disk block.
Block I/O operations manipulate a single disk block at a time. A common block I/O operation is reading and writing inodes. The kernel provides the bread() function to perform a low-level read of a single block from disk. Via 'buffers', disk blocks are mapped to their associated in-memory pages. "
Quote from the book:
Introduction to Information Retrieval
Cache
We want to keep as much data as possible in memory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.
Buffer
Operating systems generally read and write entire blocks. Thus, reading a single byte from disk can take as much time as reading the entire block. Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part of main memory where a block being read or written is stored a buffer.
Buffer is an area of memory used to temporarily store data while it's being moved from one place to another.
Cache is a temporary storage area used to store frequently accessed data for rapid access. Once the data is stored in the cache, future use can be done by accessing the cached copy rather than re-fetching the original data, so that the average access time is shorter.
Note: buffer and cache can be allocated on disk as well
Buffer contains metadata which helps improve write performance
Cache contains the file content itself (sometimes yet to write to disk) which improves read performance
For starters the general concept would be helpful, a buffer is an area of memory used to temporarily store data while being moved from one place to another. On the other hand, a cache is a temporary storage area to store frequently accessed data for rapid access.
In Linux:
The cache in Linux is called Page Cache. It is that certain amount of system memory that the kernel reserves for caching the file system disk accesses. This is to make overall performance faster. During Linux read system calls, the kernel checks if the cache contains the requested blocks of data. If it does, then that would be a successful cache hit. The cache returns this data without doing any I/O to the disk system. The Linux cache approach is called a write-back cache. This means first, the data is written to cache memory and marked as dirty until synchronized to disk. Then, the kernel maintains the internal data structure to optimize which data to evict from the cache when the cache demands any additional space. For example, when memory usage reaches certain thresholds, background tasks start writing dirty data to disk, thereby emptying the memory cache.
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Cache: This is a place acquired by kernel on physical RAM to store pages in caches. Now we need some sort of index to get the address of pages from caches. Here we need the buffer for page caches which keeps metadata of page cache.
From the man page for free:
DESCRIPTION
free displays the total amount of free and used physical and swap memory in the system, as well as the buffers and caches used by the
kernel. The information is gathered by parsing /proc/meminfo. The displayed columns are:
total Total installed memory (MemTotal and SwapTotal in /proc/meminfo)
used Used memory (calculated as total - free - buffers - cache)
free Unused memory (MemFree and SwapFree in /proc/meminfo)
shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)
buffers
Memory used by kernel buffers (Buffers in /proc/meminfo)
cache Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo)
buff/cache
Sum of buffers and cache
available
Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by the cache
or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will be reclaimed due to
items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as
free)

How to clean caches used by the Linux kernel

I want to force the Linux kernel to allocate more memory to applications after the cache starts taking up too much memory (as can be seen by the output of 'free').
I've run
sudo sync; sudo sysctl -w vm.drop_caches=3; free
(to free both disc dentry/inode cache and page cache) and I see that only about half of the used cache was freed - the rest remains. How can I tell what is taking up the rest of the cache and force it to be freed?
You may want to increase vfs_cache_pressure as well as set swappiness to 0.
Doing that will make the kernel reclaim cache faster, while giving processes equal or more favor when deciding what gets paged out.
You may only want to do this if processes you care about do very little disk I/O.
If a network I/O bound process has to swap in to serve requests, that's a problem and the real solution is to put it on a less competitive server.
With the default swappiness setting, the kernel is almost always going to favour keeping FS related cache in real memory.
As such, if you increase the cache pressure, be sure to equally adjust swappiness.
The contents of /proc/meminfo tell you what the kernel uses RAM for.
You can use /proc/sys/vm/vfs_cache_pressure to force the kernel to reclaim memory that is used for filesystem-related caches more lazily or eagerly.
Note that your application may only benefit from tuning this parameter if it does little or no disk I/O.
You might find John Nilsson's answer to my Question useful for purging the cache in order to test whether that is related to your problem:
sync && echo 1 > /proc/sys/vm/drop_caches
Though I'm guessing the only real difference is 1 vs 3

Resources