writeback of dirty pages in linux - linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?

The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.

Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Related

Storing and writing to text file in STM32 F4 internal flash memory

I need to store a text file into the STM32 F446RE internal flash memory.  This text file will contain log data that needs to be written to and updated consistently.  I know there are a couple ways of writing to it including embedding the text as constant string/data into the source code or implementing a file system like fatfs (Not suitable for STM32 F4 flash due to its sector orientation). It has total 7 sectors that vary in size.  Sectors 0-3 each contain 16 kB, 4 contains 64 kB, and 5-7 each contain 128 kB.  This translates to a total of 512 kB of Flash memory.   These are are not sufficient for what I'm looking for, and was wondering if anyone has ideas?  I'm using the STM32CubeIDE. 
Writing to FLASH memory requires an erase operation first. It seems that you already know that erase operations must be performed on whole sectors. Note also that FLASH memory wears out with repeated erase/write cycles.
I suggest one of three approaches depending upon how much data you must store and your coding abilities.
An "in-chip" approach is to implement a circular buffer in RAM and maintain your log there. If power is lost then you need code to commit that RAM buffer to FLASH. On power-up you need code to restore the RAM buffer from FLASH. This implies that your design does not suffer frequent power cycles and that you can maintain power to the microcontroller long enough to save the buffer from RAM to FLASH.
Next option is to use an external memory chip. EEPROMs are not terribly fast and are also subject to wear. FRAM is fast and suffers trillions of writes before wear issues. It is available as I2C or SPI so you can implement a number of chips to provide a reasonable buffer size and handle the chip to memory mapping in your handler code. FRAM is not cheap though.
Finally, there is an option to add an SSD drive. These devices include "wear levelling" to maintain their active lives. However, you need a suitable interface such as USB or PCI.
HTH

how to choose chunk size when reading a large file?

I know that reading a file with chunk size that is multiple of filesystem block size is better.
1) Why is that the case? I mean lets say block size is 8kb and I read 9kb. This means that it has to go and get 12kb and then get rid of the other extra 3kb.
Yes it did go and do some extra work but does that make much of a difference unless your block size is really huge?
I mean yes if I am reading 1tb file than this definitely makes a difference.
The other reason I can think of is that the block size refers to a group of sectors on hard disk (please correct me). So it could be pointing to 8 or 16 or 32 or just one sector. so your hard disk would have to do more work essentially if the block points to a lot more sectors? am I right?
2) So lets say block size is 8kb. Do I now read 16kb at a time? 1mb? 1gb? what should I use as a chunk size?
I know available memory is a limitation but apart from that what other factors affect my choice?
Thanks a lot in advance for all the answers.
Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.
If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.
So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.
But then, if there are a lot of layers above, eg, a RAID, all bets are
off.
Really, the best you can do is to benchmark your particular
circumstances.

Bypassing 4KB block size limitation on block layer/device

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.
The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.
Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

What is the difference between buffer and cache memory in Linux?

To me it's not clear what's the difference between the two Linux memory concepts : buffer and cache. I've read through this post and it seems to me that the difference between them is the expiration policy:
buffer's policy is first-in, first-out
cache's policy is Least Recently Used.
Am I right?
In particular, I'm looking at the two commands: free and vmstat
james#utopia:~$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 0 0 173 67 912 0 0 19 59 75 1087 24 4 71 1
james#utopia:~$ free -m
total used free shared buffers cached
Mem: 2007 1834 172 0 67 914
-/+ buffers/cache: 853 1153
Swap: 2859 0 2859
Buffers are associated with a specific block device, and cover caching
of filesystem metadata as well as tracking in-flight pages. The cache
only contains parked file data. That is, the buffers remember what's
in directories, what file permissions are, and keep track of what
memory is being written from or read to for a particular block device.
The cache only contains the contents of the files themselves.
quote link
Cited answer (for reference):
Short answer: Cached is the size of the page cache. Buffers is the size of in-memory block I/O buffers. Cached matters; Buffers is largely irrelevant.
Long answer: Cached is the size of the Linux page cache, minus the memory in the swap cache, which is represented by SwapCached (thus the total page cache size is Cached + SwapCached). Linux performs all file I/O through the page cache. Writes are implemented as simply marking as dirty the corresponding pages in the page cache; the flusher threads then periodically write back to disk any dirty pages. Reads are implemented by returning the data from the page cache; if the data is not yet in the cache, it is first populated. On a modern Linux system, Cached can easily be several gigabytes. It will shrink only in response to memory pressure. The system will purge the page cache along with swapping data out to disk to make available more memory as needed.
Buffers are in-memory block I/O buffers. They are relatively short-lived. Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers is raw disk blocks not represented in the page cache—i.e., not file data. The Buffers metric is thus of minimal importance. On most systems, Buffers is often only tens of megabytes.
"Buffers" represent how much portion of RAM is dedicated to cache disk blocks. "Cached" is similar like "Buffers", only this time it caches pages from file reading.
quote from:
https://web.archive.org/web/20110207101856/http://www.linuxforums.org/articles/using-top-more-efficiently_89.html
It's not 'quite' as simple as this, but it might help understand:
Buffer is for storing file metadata (permissions, location, etc). Every memory page is kept track of here.
Cache is for storing actual file contents.
Explained by Red Hat:
Cache Pages:
A cache is the part of the memory which transparently stores data so that future requests for that data can be served faster. This memory is utilized by the kernel to cache disk data and improve i/o performance.
The Linux kernel is built in such a way that it will use as much RAM as it can to cache information from your local and remote filesystems and disks. As the time passes over various reads and writes are performed on the system, kernel tries to keep data stored in the memory for the various processes which are running on the system or the data that of relevant processes which would be used in the near future. The cache is not reclaimed at the time when process get stop/exit, however when the other processes requires more memory then the free available memory, kernel will run heuristics to reclaim the memory by storing the cache data and allocating that memory to new process.
When any kind of file/data is requested then the kernel will look for a copy of the part of the file the user is acting on, and, if no such copy exists, it will allocate one new page of cache memory and fill it with the appropriate contents read out from the disk.
The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere in the disk. When some data is requested, the cache is first checked to see whether it contains that data. The data can be retrieved more quickly from the cache than from its source origin.
SysV shared memory segments are also accounted as a cache, though they do not represent any data on the disks. One can check the size of the shared memory segments using ipcs -m command and checking the bytes column.
Buffers:
Buffers are the disk block representation of the data that is stored under the page caches. Buffers contains the metadata of the files/data which resides under the page cache.
Example: When there is a request of any data which is present in the page cache, first the kernel checks the data in the buffers which contain the metadata which points to the actual files/data contained in the page caches. Once from the metadata the actual block address of the file is known, it is picked up by the kernel for processing.
buffer and cache.
A buffer is something that has yet to be "written" to disk.
A cache is something that has been "read" from the disk and stored for later use.
I think this page will help understanding the difference between buffer and cache deeply. http://www.tldp.org/LDP/sag/html/buffer-cache.html
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Since memory is, unfortunately, a finite, nay, scarce resource, the buffer cache usually cannot be big enough (it can't hold all the data one ever wants to use). When the cache fills up, the data that has been unused for the longest time is discarded and the memory thus freed is used for the new data.
Disk buffering works for writes as well. On the one hand, data that is written is often soon read again (e.g., a source code file is saved to a file, then read by the compiler), so putting data that is written in the cache is a good idea. On the other hand, by only putting the data into the cache, not writing it to disk at once, the program that writes runs quicker. The writes can then be done in the background, without slowing down the other programs.
Seth Robertson's Link 2 said "For thorough understanding of those terms, refer to Linux kernel book like Linux Kernel Development by Robert M. Love."
I found some contents about 'buffer' in the 2nd edition of the book.
Although the physical device itself is addressable at the sector level, the kernel performs all disk operations in terms of blocks.
When a block is stored in memory (say, after a read or pending a write), it is stored in a 'buffer'. Each 'buffer' is associated with exactly one block. The 'buffer' serves as the object that represents a disk block in memory.
A 'buffer' is the in-memory representation of a single physical disk block.
Block I/O operations manipulate a single disk block at a time. A common block I/O operation is reading and writing inodes. The kernel provides the bread() function to perform a low-level read of a single block from disk. Via 'buffers', disk blocks are mapped to their associated in-memory pages. "
Quote from the book:
Introduction to Information Retrieval
Cache
We want to keep as much data as possible in memory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.
Buffer
Operating systems generally read and write entire blocks. Thus, reading a single byte from disk can take as much time as reading the entire block. Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part of main memory where a block being read or written is stored a buffer.
Buffer is an area of memory used to temporarily store data while it's being moved from one place to another.
Cache is a temporary storage area used to store frequently accessed data for rapid access. Once the data is stored in the cache, future use can be done by accessing the cached copy rather than re-fetching the original data, so that the average access time is shorter.
Note: buffer and cache can be allocated on disk as well
Buffer contains metadata which helps improve write performance
Cache contains the file content itself (sometimes yet to write to disk) which improves read performance
For starters the general concept would be helpful, a buffer is an area of memory used to temporarily store data while being moved from one place to another. On the other hand, a cache is a temporary storage area to store frequently accessed data for rapid access.
In Linux:
The cache in Linux is called Page Cache. It is that certain amount of system memory that the kernel reserves for caching the file system disk accesses. This is to make overall performance faster. During Linux read system calls, the kernel checks if the cache contains the requested blocks of data. If it does, then that would be a successful cache hit. The cache returns this data without doing any I/O to the disk system. The Linux cache approach is called a write-back cache. This means first, the data is written to cache memory and marked as dirty until synchronized to disk. Then, the kernel maintains the internal data structure to optimize which data to evict from the cache when the cache demands any additional space. For example, when memory usage reaches certain thresholds, background tasks start writing dirty data to disk, thereby emptying the memory cache.
Reading from a disk is very slow compared to accessing (real) memory. In addition, it is common to read the same part of a disk several times during relatively short periods of time. For example, one might first read an e-mail message, then read the letter into an editor when replying to it, then make the mail program read it again when copying it to a folder. Or, consider how often the command ls might be run on a system with many users. By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
Cache: This is a place acquired by kernel on physical RAM to store pages in caches. Now we need some sort of index to get the address of pages from caches. Here we need the buffer for page caches which keeps metadata of page cache.
From the man page for free:
DESCRIPTION
free displays the total amount of free and used physical and swap memory in the system, as well as the buffers and caches used by the
kernel. The information is gathered by parsing /proc/meminfo. The displayed columns are:
total Total installed memory (MemTotal and SwapTotal in /proc/meminfo)
used Used memory (calculated as total - free - buffers - cache)
free Unused memory (MemFree and SwapFree in /proc/meminfo)
shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)
buffers
Memory used by kernel buffers (Buffers in /proc/meminfo)
cache Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo)
buff/cache
Sum of buffers and cache
available
Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by the cache
or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will be reclaimed due to
items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as
free)

What are the exact conditions based on which Linux swaps process(s) memory from RAM to a swap file?

My server has 8Gigs of RAM and 8Gigs configured for swap file. I have memory intensive apps running. These apps have peak loads during which we find swap usage increase. Approximately 1 GIG of swap is used.
I have another server with 4Gigs of RAM and 8 Gigs of swap and similar memory intensive apps running on it. But here swap usage is very negligible. Around 100 MB.
I was wondering what are the exact conditions or a rough formula based on which Linux will do a swapout of a process memory in RAM to the swap file.
I know its based on swapiness factor. What else is it based on? Swap file size? Any pointers to Linux kernel documentation/source code explaining this will be great.
I've seen a lot of people posting subjective explanations of what this does. Here is hopefully a more full answer.
In the split LRU on post 2.6.28 Linux swappiness is a multiplier used to arbitrarily modify the fraction that is calculated determining the pressure built up in both LRUs.
So, for example on a system with no free memory left - the value of the existing memory you have is measured based off of the rate of how much memory is listed as 'Active' and the rate of how often pages are promoted to active after falling into the inactive list.
An LRU with many promotions/demotions of pages between active and inactive is in a lot of use.
Typically file backed storage is cheaper and safer to evict when your running out of memory and automatically is given a modifier of 200 (this makes file backed memory 200 times more worthless than swap backed memory (Which has a value of 0) when it multiplies this fraction.
What swappiness does is modify this value by deducting the swappiness number you gave (default 60) to file memory and adding the swappiness value you gave as a multiplier to anon memory. Thus the default swappiness leaves you with anonymous memory being 80 times more valuable than file memory (200-60 for file, 0+60 for anon). Thus, on a typical linux system that has used up all its memory, page cache would have to be 80 TIMES more active than anonymous memory for anonymous memory to be swapped out in favour of page cache.
If you set swappiness to 100 this gives anon a modifier of 100 and file memory a modifier of 100 (200 - 100) leaving both LRUs equally weighted. Thus on a file heavy system that wants page cache providing the anon memory is not as active as page cache then anon memory will be swapped to disk to make space for extra page cache.
Linux (or any other OS) divides memory up into pages (typically 4Kb). Each of these pages represent a chunk of memory. Usage information for these pages is maintained, which basically contains info about whether the page is free or is in use (part of some process), whether it has been accessed recently, what kind of data it contains (process data, executable code etc.), owner of the page, etc. These pages can also be broadly divided into two categories - filesystem pages or the page cache (in which all data read/written to your filesystem resides) and pages belonging to processes.
When the system is running low on memory, the kernel starts swapping out pages based on their usage. Using a list of pages sorted w.r.t recency of access is common for determining which pages can be swapped out (linux kernel has such a list too).
During swapping, Linux kernel needs to decide what to trade-off when nuking pages in memory and sending them to swap. If it swaps filesystem pages too aggressively, more reads are required from the filesystem to read those pages back when they are needed. However, if it swaps out process pages more aggressively it can hurt interactivity, because when the user tries to use the swapped out processes, they will have to be read back from the disk. See a nice discussion here on this.
By setting swappiness = 0, you are telling the linux kernel not to swap out pages belonging to processes. When setting swappiness = 100 instead, you tell the kernel to swap out pages belonging to processes more aggressively. To tune your system, try changing the swappiness parameter in steps of 10, monitoring performance and pages being swapped in/out at each setting using the "vmstat" command. Keep the setting that gives you the best results. Remember to do this testing during peak usage hours. :)
For database applications, swappiness = 0 is generally recommended. (Even then, test different settings on your systems to arrive to a good value).
References:
http://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness/
http://www.pythian.com/news/1913/

Resources