Swap space and dirty pages - pagination

I can't understand the utility of dirty bit, that should be useful during pages replacement, to mark dirty pages.
Swap space is a disk portion where OS puts pages that don't fit in primary memory. So, why a not-dirty page shouldn't be written on disk?
Let's take for example a page swapped out from memory to the disk. At this point let's imagine that it is first moved to primary memory again and then it is moved back to disk again.
When it is moved to primary memory, I don't think the disk will retain a copy of it.
Therefore, even if this page does not get dirty in primary memory, why it should not be rewritten on the disk when it is freed again from primary memory?

When the page is swapped back into memory (loaded into RAM from disk) the bits in the swap file are not invalidated or erased - they still contain the same values that were written out when the page was swapped from RAM to disk. So at the point when it is swapped from disk to RAM the pages in RAM and disk are identical. If no writes are performed then the RAM and disk (swap) versions of the page remain identical. If the kernel decides to swap this page out of RAM again, there is no need to write it to disk (swap) because the correct contents of the page is already on disk. So the page can simply be freed and use for some other purpose. But if a write has been performed then the version of the page on disk and in swap is different, and in this case the dirty bit is set indicating that the page must be written to disk before it can be reused.

Processors that use a dirty bit set that bit whenever a write is made to a page.
If the bit is clear, that means the page has not been changed. If the operating system needs to page out that page,it know that it does not have to write that page (with a clear dirty bit) back to the paging file.

Related

Linux memory management (caching)

I'm having a hard time telling the difference between the different cache areas (OS). I'd love a brief explanation about disk\buffer\swap\page cache. Where do they reside? What are the main differences between them?
From what I understand the page cache is part of the main memory that stores pages brought from an I/O device.
Are buffer cache and disk cache the same? Do they "live" at the I/O device?
Many thanks!!
In linux two caches were distinct: Files were in the page cache, disk blocks were in the buffer cache. Given that most files are represented by a filesystem on a disk, data was represented twice, once in each of the caches. Many Unix systems follow a similar pattern.
The buffer cache remains, however, as the kernel still needs to perform block I/O in terms of blocks, not pages. As most blocks represent file data, most of the buffer cache is represented by the page cache. But a small amount of block data isn't file backed—metadata and raw block I/O for example—and thus is solely represented by the buffer cache.
Disk cache/Buffer cache
This cache caches disk blocks to optimize block I/O.
It is the RAM used for faster access to disk.It is embedded in the disk or it can be portion of Main memory set aside.
Swap cache/Page cache
This cache caches pages of files to optimize file I/O
The swap cache is a list of page table entries. This page table entry for a swapped out page and describes which swap file the page is being held in together with its location in the swap file, so that when has to be brought back again we will be having its location in swap file.
It resides on disk.

Virtually contiguous vs. physically contiguous memory

Is virtually contiguous memory also always physically contiguous? If not, how is virtually continuous memory allocated and memory-mapped over physically non-contiguous RAM blocks? A detailed answer is appreciated.
Short answer: You need not care (unless you're a kernel/driver developer). It is all the same to you.
Longer answer: On the contrary, virtually contiguous memory is usually not physically contiguous (only in very small amounts). Except by coincidence, or shortly after the machine has just booted. That isn't necessary, however.
The only way of allocating larger amounts of physically contiguous RAM is by using large pages (since the memory within one page needs to be contiguous). It is however a useless endeavor, since there is no observable difference for your process whether memory of which you think that it is contiguous is actually contiguous, but there are disadvantages to using large pages.
Memory mapping over phyically non-contiuous RAM works in no particularly "special" way. It follows the same method which all memory management follows.
The OS divides virtual memory in "pages" and creates page table entries for your process. When you access a memory in some location, either the corresponding page does not exist at all, or it exists and corresponds to a real page in RAM, or it exists but doesn't correspond to a real page in RAM.
If the page exists in RAM, nothing happens at all1. Otherwise a fault is generated and some operating system code is run. If it turns out the page doesn't exist at all (or does not have the correct access rights), your process is killed with a segmentation fault.
Otherwise, the OS chooses an arbitrary page that isn't used (or it swaps out the one it thinks is the least important one), and loads the data from disk into that page. In the case of a memory mapping, the data comes from the mapped file, otherwise it comes from swap (and for completely new allocated memory, the zero page is copied). The OS then returns control back to your process. You never know this happened.
If you access another location in a "contiguous" (or so you think!) memory area which lies in a different page, the exact same procedure runs.
1 In reality, it is a little more complicated, since a page may exist in RAM but not exist "officially", being part of a list of pages that are to be recycled or such. But this gets too complicated.
No, it doesn't have to. Any page of the virtual memory can be mapped to an arbitrary physical page. Therefore you can have adjacent pages of your virtual memory pointing to non-adjacent physical pages. This mapping is maintained by the OS and is used by the MMU unit of CPU.

Write a cached page before it is reclaimed

everyone. I am stuck on the following question.
I am working on a hybrid storage system which uses an ssd as a cache layer for hard disk. To this end, the data read from the hard disk should be written to the ssd to boost the subsequent reads of this data. Since Linux caches data read from disk in the page cache, the writing of data to the ssd can be delayed; however, the pages caching the data may be freed, and accessing the freed pages is not recommended. Here is the question: I have "struct page" pointers pointing to the pages to be written to the ssd. Is there any way to determine whether the page represented by the pointer is valid or not (by valid I mean the cached page can be safely written to the ssd? What will happen if a freed page is accessed via the pointer? Is the data of the freed page the same as that before freeing?
Are you using cleancache module? You should only get valid pages from it and it should remain valid until your callback function finished.
Isn't this a cleancache/frontswap reimplementation? (https://www.kernel.org/doc/Documentation/vm/cleancache.txt).
The benefit of existing cleancache code is that it calls your code only just before it frees a page, so before the page resides in RAM, and when there is no space left in RAM for it the kernel calls your code to back it up in tmem (transient memory).
Searching I also found an existing project that seems to do exactly this: http://bcache.evilpiepirate.org/:
Bcache is a Linux kernel block layer cache. It allows one or more fast
disk drives such as flash-based solid state drives (SSDs) to act as a
cache for one or more slower hard disk drives.
Bcache patches for the Linux kernel allow one to use SSDs to cache
other block devices. It's analogous to L2Arc for ZFS, but Bcache also
does writeback caching (besides just write through caching), and it's
filesystem agnostic. It's designed to be switched on with a minimum of
effort, and to work well without configuration on any setup. By
default it won't cache sequential IO, just the random reads and writes
that SSDs excel at. It's meant to be suitable for desktops, servers,
high end storage arrays, and perhaps even embedded.
What you are trying to achieve looks like the following:
Before the page is evicted from the pagecache, you want to cache it. This, in concept, is called a Victim cache. You can look for papers around this.
What you need is a way to "pin" the pages targeted for eviction for the duration of the IO. Post IO, you can free the pagecache page.
But, this will delay the eviction, which is possibly needed during memory pressure to create more un-cached pages.
So, one possible solution is to start your caching algorithm a bit before the pagecache eviction starts.
A second possible solution is to set aside a bunch of free pages and exchange the page being evicted form the page cache with a page from the free pool, and cache the evicted page in the background. But, you need to now synchronize with file block deletes, etc

Why does MongoDB's memory mapped files cause programs like top to show larger numbers than normal?

I am trying to wrap my head around the internals of mongodb, and I keep reading about this
http://www.theroadtosiliconvalley.com/technology/mongodb-mongo-nosql-db/
Why does this happen?
So the way memorry mapped files work is that the addresses in memory are mapped byte for byte with a file on disk. This makes it really fast and but really large. Imagine a file on disk for your data taking up that size of memory.
Why it's awesome
In practice, this rocks because writing and reading from memory directly instead of issuing a system call (think context switch) is fast. Also, in practice, the fact that this huge memory mapped chunk doesn't fit in your physical ram is fine. Why? You only need the working set of data to fit in ram because the non-used pages are not loaded and just kept on disk. If they are needed a page fault happens and it gets loaded up. (I believe the portion that has been loaded is referred to as resident memory)
Why it it kind of sucks
Files mapped in memory needs to be page aligned so if you don't use up the memory space on the page boundary exactly you waste space (small tradoff)
Summary (tldnr)
It may look like its taking up a lot of resources because its mapping the entirety of your data to memory addresses but it doesn't really matter as that data isn't actually all being held in RAM. Mongo will pull in data as it needs it and use memory effectively to maintain a performant working set.

What is the difference between buffer and cache memory in Linux?

To me it's not clear what's the difference between the two Linux memory concepts : buffer and cache. I've read through this post and it seems to me that the difference between them is the expiration policy:
buffer's policy is first-in, first-out
cache's policy is Least Recently Used.
Am I right?
In particular, I'm looking at the two commands: free and vmstat
james#utopia:~$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 0 173 67 912 0 0 19 59 75 1087 24 4 71 1
james#utopia:~$ free -m
total used free shared buffers cached
Mem: 2007 1834 172 0 67 914
-/+ buffers/cache: 853 1153
Swap: 2859 0 2859
Buffers are associated with a specific block device, and cover caching
of filesystem metadata as well as tracking in-flight pages. The cache
only contains parked file data. That is, the buffers remember what's
in directories, what file permissions are, and keep track of what
memory is being written from or read to for a particular block device.
The cache only contains the contents of the files themselves.
quote link
Cited answer (for reference):
Short answer: Cached is the size of the page cache. Buffers is the size of in-memory block I/O buffers. Cached matters; Buffers is largely irrelevant.
Long answer: Cached is the size of the Linux page cache, minus the memory in the swap cache, which is represented by SwapCached (thus the total page cache size is Cached + SwapCached). Linux performs all file I/O through the page cache. Writes are implemented as simply marking as dirty the corresponding pages in the page cache; the flusher threads then periodically write back to disk any dirty pages. Reads are implemented by returning the data from the page cache; if the data is not yet in the cache, it is first populated. On a modern Linux system, Cached can easily be several gigabytes. It will shrink only in response to memory pressure. The system will purge the page cache along with swapping data out to disk to make available more memory as needed.
Buffers are in-memory block I/O buffers. They are relatively short-lived. Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers is raw disk blocks not represented in the page cache—i.e., not file data. The Buffers metric is thus of minimal importance. On most systems, Buffers is often only tens of megabytes.
"Buffers" represent how much portion of RAM is dedicated to cache disk blocks. "Cached" is similar like "Buffers", only this time it caches pages from file reading.
quote from:
https://web.archive.org/web/20110207101856/http://www.linuxforums.org/articles/using-top-more-efficiently_89.html
It's not 'quite' as simple as this, but it might help understand:
Buffer is for storing file metadata (permissions, location, etc). Every memory page is kept track of here.
Cache is for storing actual file contents.
Explained by Red Hat:
Cache Pages:
A cache is the part of the memory which transparently stores data so that future requests for that data can be served faster. This memory is utilized by the kernel to cache disk data and improve i/o performance.
The Linux kernel is built in such a way that it will use as much RAM as it can to cache information from your local and remote filesystems and disks. As the time passes over various reads and writes are performed on the system, kernel tries to keep data stored in the memory for the various processes which are running on the system or the data that of relevant processes which would be used in the near future. The cache is not reclaimed at the time when process get stop/exit, however when the other processes requires more memory then the free available memory, kernel will run heuristics to reclaim the memory by storing the cache data and allocating that memory to new process.
When any kind of file/data is requested then the kernel will look for a copy of the part of the file the user is acting on, and, if no such copy exists, it will allocate one new page of cache memory and fill it with the appropriate contents read out from the disk.
The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere in the disk. When some data is requested, the cache is first checked to see whether it contains that data. The data can be retrieved more quickly from the cache than from its source origin.
SysV shared memory segments are also accounted as a cache, though they do not represent any data on the disks. One can check the size of the shared memory segments using ipcs -m command and checking the bytes column.
Buffers:
Buffers are the disk block representation of the data that is stored under the page caches. Buffers contains the metadata of the files/data which resides under the page cache.
Example: When there is a request of any data which is present in the page cache, first the kernel checks the data in the buffers which contain the metadata which points to the actual files/data contained in the page caches. Once from the metadata the actual block address of the file is known, it is picked up by the kernel for processing.
buffer and cache.
A buffer is something that has yet to be "written" to disk.
A cache is something that has been "read" from the disk and stored for later use.
I think this page will help understanding the difference between buffer and cache deeply. http://www.tldp.org/LDP/sag/html/buffer-cache.html
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Since memory is, unfortunately, a finite, nay, scarce resource, the buffer cache usually cannot be big enough (it can't hold all the data one ever wants to use). When the cache fills up, the data that has been unused for the longest time is discarded and the memory thus freed is used for the new data.
Disk buffering works for writes as well. On the one hand, data that is written is often soon read again (e.g., a source code file is saved to a file, then read by the compiler), so putting data that is written in the cache is a good idea. On the other hand, by only putting the data into the cache, not writing it to disk at once, the program that writes runs quicker. The writes can then be done in the background, without slowing down the other programs.
Seth Robertson's Link 2 said "For thorough understanding of those terms, refer to Linux kernel book like Linux Kernel Development by Robert M. Love."
I found some contents about 'buffer' in the 2nd edition of the book.
Although the physical device itself is addressable at the sector level, the kernel performs all disk operations in terms of blocks.
When a block is stored in memory (say, after a read or pending a write), it is stored in a 'buffer'. Each 'buffer' is associated with exactly one block. The 'buffer' serves as the object that represents a disk block in memory.
A 'buffer' is the in-memory representation of a single physical disk block.
Block I/O operations manipulate a single disk block at a time. A common block I/O operation is reading and writing inodes. The kernel provides the bread() function to perform a low-level read of a single block from disk. Via 'buffers', disk blocks are mapped to their associated in-memory pages. "
Quote from the book:
Introduction to Information Retrieval
Cache
We want to keep as much data as possible in memory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.
Buffer
Operating systems generally read and write entire blocks. Thus, reading a single byte from disk can take as much time as reading the entire block. Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part of main memory where a block being read or written is stored a buffer.
Buffer is an area of memory used to temporarily store data while it's being moved from one place to another.
Cache is a temporary storage area used to store frequently accessed data for rapid access. Once the data is stored in the cache, future use can be done by accessing the cached copy rather than re-fetching the original data, so that the average access time is shorter.
Note: buffer and cache can be allocated on disk as well
Buffer contains metadata which helps improve write performance
Cache contains the file content itself (sometimes yet to write to disk) which improves read performance
For starters the general concept would be helpful, a buffer is an area of memory used to temporarily store data while being moved from one place to another. On the other hand, a cache is a temporary storage area to store frequently accessed data for rapid access.
In Linux:
The cache in Linux is called Page Cache. It is that certain amount of system memory that the kernel reserves for caching the file system disk accesses. This is to make overall performance faster. During Linux read system calls, the kernel checks if the cache contains the requested blocks of data. If it does, then that would be a successful cache hit. The cache returns this data without doing any I/O to the disk system. The Linux cache approach is called a write-back cache. This means first, the data is written to cache memory and marked as dirty until synchronized to disk. Then, the kernel maintains the internal data structure to optimize which data to evict from the cache when the cache demands any additional space. For example, when memory usage reaches certain thresholds, background tasks start writing dirty data to disk, thereby emptying the memory cache.
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Cache: This is a place acquired by kernel on physical RAM to store pages in caches. Now we need some sort of index to get the address of pages from caches. Here we need the buffer for page caches which keeps metadata of page cache.
From the man page for free:
DESCRIPTION
free displays the total amount of free and used physical and swap memory in the system, as well as the buffers and caches used by the
kernel. The information is gathered by parsing /proc/meminfo. The displayed columns are:
total Total installed memory (MemTotal and SwapTotal in /proc/meminfo)
used Used memory (calculated as total - free - buffers - cache)
free Unused memory (MemFree and SwapFree in /proc/meminfo)
shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)
buffers
Memory used by kernel buffers (Buffers in /proc/meminfo)
cache Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo)
buff/cache
Sum of buffers and cache
available
Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by the cache
or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will be reclaimed due to
items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as
free)

Resources