Do Kernel pages get swapped out? - linux

Pertaining to the Linux kernel, do "Kernel" pages ever get swapped out ? Also, do User space pages ever get to reside in ZONE_NORMAL ?

No, kernel memory is unswappable.

Kernel pages are not swappable. But it can be freed.
UserSpace Pages can reside in ZONE_NORMAL.
Linux System Can be configured either to use HIGHMEM or not.
If ZONE_HIGHMEM is configured , then the userspace processes will get its memory from the HIGHMEM else userspace processes will get memory from ZONE_NORMAL.

Yes, under normal circumstances kernel pages (ie, memory residing in the kernel for kernel usage) are not swappable, in fact, once detected (see the pagefault handler source code), the kernel will explicitly crash itself.
See this:
http://lxr.free-electrons.com/source/arch/x86/mm/fault.c
and the function:
1205 /*
1206 * This routine handles page faults. It determines the address,
1207 * and the problem, and then passes it off to one of the appropriate
1208 * routines.
1209 *
1210 * This function must have noinline because both callers
1211 * {,trace_}do_page_fault() have notrace on. Having this an actual function
1212 * guarantees there's a function trace entry.
1213 */
1214 static noinline void
1215 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
1216 unsigned long address)
1217 {
And the detection here:
1246 *
1247 * This verifies that the fault happens in kernel space
1248 * (error_code & 4) == 0, and that the fault was not a
1249 * protection error (error_code & 9) == 0.
1250 */
1251 if (unlikely(fault_in_kernel_space(address))) {
1252 if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
1253 if (vmalloc_fault(address) >= 0)
1254 return;
1255
1256 if (kmemcheck_fault(regs, address, error_code))
1257 return;
1258 }
But the same pagefault handler - which can detect pagefault arising from non-existent usermode memory (all hardware pagefault detection is always done in kernel) will explicitly retrieve the data from swap space if it exists, or start a memory allocation routine to give the process more memory.
Ok, that said, kernel does swap out kernel structures/memory/tasklists etc during software suspend and hibernation operation:
https://www.kernel.org/doc/Documentation/power/swsusp.txt
And during the resume phase it will restore back the kernel memory from swap file.

Related

Where are struct page* initialized in Linux?

As we know, struct page in Linux, associated with a physical 4KB page and mapped to a pfn. This forms the backbone to memory allocation in Linux.
The struct page is described in include\linux\mm_types.h contains variety of information about the page. I want to know, when are struct page allocated during boot and who initialize these struct page structures and where (in linux)?
I was able to get some idea of where the pages are stored from this answer -https://stackoverflow.com/a/63893944/13286624 but I'm unable to find how these structures are allocated and initialized.
Maybe try mm_init, called by start_kernel() in /init/main.c/ and the corresponding mem_init in /arch/x86/mm/init_64.c :
void __init mem_init(void)
1286 {
1287 pci_iommu_alloc();
1288
1289 /* clear_bss() already clear the empty_zero_page */
1290
1291 /* this will put all memory onto the freelists */
1292 memblock_free_all();
1293 after_bootmem = 1;
1294 x86_init.hyper.init_after_bootmem();
1295
1296 /*
1297 * Must be done after boot memory is put on freelist, because here we
1298 * might set fields in deferred struct pages that have not yet been
1299 * initialized, and memblock_free_all() initializes all the reserved
1300 * deferred pages for us.
1301 */
1302 register_page_bootmem_info();
You should take a look at the include/linux/gfp.h file. It has a bunch of alloc_page type functions. From there, using cscope or your favorite IDE, you can see every function that calls these functions and that will give you a sense of how the functions are used.

Understanding IRQs usage

In the last couple of days, I've trying to implement a simple interrupt handler in C. So far so good, I think I've achieved my initial task. My ultimate goal is to inject some faults in the kernel through bit-flipping some CPU registers. After reading this, this and this, I chose IRQ number 10, since it seems it is a free, "open interrupt". My doubt is: how can I know if this is the "best" IRQ to call my interrupt handler? How can I possibly decide on this?
Thanks and all the best,
João
P.S.: As far as I know, there are 16 IRQ lines, but Linux has 256 IDT entries, as clearly stated in /include/asm/irq_vectors.h (please see below). This also seems to be puzzling me.
/*
* Linux IRQ vector layout.
*
* There are 256 IDT entries (per CPU - each entry is 8 bytes) which can
* be defined by Linux. They are used as a jump table by the CPU when a
* given vector is triggered - by a CPU-external, CPU-internal or
* software-triggered event.
*
* Linux sets the kernel code address each entry jumps to early during
* bootup, and never changes them. This is the general layout of the
* IDT entries:
*
* Vectors 0 ... 31 : system traps and exceptions - hardcoded events
* Vectors 32 ... 127 : device interrupts
* Vector 128 : legacy int80 syscall interface
* Vectors 129 ... 237 : device interrupts
* Vectors 238 ... 255 : special interrupts
*
* 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
*
* This file enumerates the exact layout of them:
*/

what does "malloc_trim(0)" really mean?

The manual page told me so much and through it I know lots of the background knowledge of memory management of "glibc".
But I still get confused. does "malloc_trim(0)"(note zero as the parameter) mean (1.)all the memory in the "heap" section will be returned to OS ? Or (2.)just all "unused" memory of the top most region of the heap will be returned to OS ?
If the answer is (1.), what if the still used memory in the heap? if the heap has used momery at places somewhere, will them be eliminated, or the function wouldn't execute successfully?
While if the answer is (2.), what about those "holes" at places rather than top in the heap? they're unused memory anymore, but the top most region of the heap is still used, will this calling work efficiently?
Thanks.
Man page of malloc_trim was committed here: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and as I understand, it was written by man-pages project maintainer, kerrisk in 2012 from scratch: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
There are malloc_trim comments from malloc/malloc.c:
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
In december 2007 there was commit https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc by Ulrich Drepper (it is part of glibc 2.9 and newer) which changed mtrim implementation (but it didn't change any documentation or man page as there are no man pages in glibc):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
Unused parts of chunks (anywhere, including chunks in the middle), aligned on page size and having size more than page may be marked as MADV_DONTNEED https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=malloc/malloc.c;h=c54c203cbf1f024e72493546221305b4fd5729b7;hp=1e716089a2b976d120c304ad75dd95c63737ad75;hb=68631c8eb92ff38d9da1ae34f6aa048539b199cc;hpb=52386be756e113f20502f181d780aecc38cbb66a
INTERNAL_SIZE_T size = chunksize (p);
if (size > psm1 + sizeof (struct malloc_chunk))
{
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
}
This is one of total two usages of madvise with MADV_DONTNEED in glibc now, one for top part of heaps (shrink_heap) and other is marking of any chunk (mtrim): http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
We can test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
// check that all arrays are allocated on the heap and not with mmap
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
// free 40000 bytes in the middle
free(m2);
// call trim (same result with 2000 or 2000000 argument)
malloc_trim(0);
// call some syscall to find this point in the strace output
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
...
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there was madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
The man page for malloc_trim says it releases free memory, so if there is allocated memory in the heap, it won't release the whole heap. The parameter is there if you know you're still going to need a certain amount of memory, so freeing more than that would cause glibc to have to do unnecessary work later.
As for holes, this is a standard problem with memory management and returning memory to the OS. The primary low-level heap management available to the program is brk and sbrk, and all they can do is extend or shrink the heap area by changing the top. So there's no way for them to return holes to the operating system; once the program has called sbrk to allocate more heap, that space can only be returned if the top of that space is free and can be handed back.
Note that there are other, more complex ways to allocate memory (with anonymous mmap, for example), which may have different constraints than sbrk-based allocation.

What is RSS and VSZ in Linux memory management

What are RSS and VSZ in Linux memory management? In a multithreaded environment how can both of these can be managed and tracked?
RSS is the Resident Set Size and is used to show how much memory is allocated to that process and is in RAM. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
VSZ is the Virtual Memory Size. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.
So if process A has a 500K binary and is linked to 2500K of shared libraries, has 200K of stack/heap allocations of which 100K is actually in memory (rest is swapped or unused), and it has only actually loaded 1000K of the shared libraries and 400K of its own binary then:
RSS: 400K + 1000K + 100K = 1500K
VSZ: 500K + 2500K + 200K = 3200K
Since part of the memory is shared, many processes may use it, so if you add up all of the RSS values you can easily end up with more space than your system has.
The memory that is allocated also may not be in RSS until it is actually used by the program. So if your program allocated a bunch of memory up front, then uses it over time, you could see RSS going up and VSZ staying the same.
There is also PSS (proportional set size). This is a newer measure which tracks the shared memory as a proportion used by the current process. So if there were two processes using the same shared library from before:
PSS: 400K + (1000K/2) + 100K = 400K + 500K + 100K = 1000K
Threads all share the same address space, so the RSS, VSZ and PSS for each thread is identical to all of the other threads in the process. Use ps or top to view this information in linux/unix.
There is way more to it than this, to learn more check the following references:
http://manpages.ubuntu.com/manpages/en/man1/ps.1.html
https://web.archive.org/web/20120520221529/http://emilics.com/blog/article/mconsumption.html
Also see:
A way to determine a process's "real" memory usage, i.e. private dirty RSS?
RSS is Resident Set Size (physically resident memory - this is currently occupying space in the machine's physical memory), and VSZ is Virtual Memory Size (address space allocated - this has addresses allocated in the process's memory map, but there isn't necessarily any actual memory behind it all right now).
Note that in these days of commonplace virtual machines, physical memory from the machine's view point may not really be actual physical memory.
Minimal runnable example
For this to make sense, you have to understand the basics of paging: How does x86 paging work? and in particular that the OS can allocate virtual memory via page tables / its internal memory book keeping (VSZ virtual memory) before it actually has a backing storage on RAM or disk (RSS resident memory).
Now to observe this in action, let's create a program that:
allocates more RAM than our physical memory with mmap
writes one byte on each page to ensure that each of those pages goes from virtual only memory (VSZ) to actually used memory (RSS)
checks the memory usage of the process with one of the methods mentioned at: Memory usage of current process in C
main.c
#define _GNU_SOURCE
#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
typedef struct {
unsigned long size,resident,share,text,lib,data,dt;
} ProcStatm;
/* https://stackoverflow.com/questions/1558402/memory-usage-of-current-process-in-c/7212248#7212248 */
void ProcStat_init(ProcStatm *result) {
const char* statm_path = "/proc/self/statm";
FILE *f = fopen(statm_path, "r");
if(!f) {
perror(statm_path);
abort();
}
if(7 != fscanf(
f,
"%lu %lu %lu %lu %lu %lu %lu",
&(result->size),
&(result->resident),
&(result->share),
&(result->text),
&(result->lib),
&(result->data),
&(result->dt)
)) {
perror(statm_path);
abort();
}
fclose(f);
}
int main(int argc, char **argv) {
ProcStatm proc_statm;
char *base, *p;
char system_cmd[1024];
long page_size;
size_t i, nbytes, print_interval, bytes_since_last_print;
int snprintf_return;
/* Decide how many ints to allocate. */
if (argc < 2) {
nbytes = 0x10000;
} else {
nbytes = strtoull(argv[1], NULL, 0);
}
if (argc < 3) {
print_interval = 0x1000;
} else {
print_interval = strtoull(argv[2], NULL, 0);
}
page_size = sysconf(_SC_PAGESIZE);
/* Allocate the memory. */
base = mmap(
NULL,
nbytes,
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS,
-1,
0
);
if (base == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}
/* Write to all the allocated pages. */
i = 0;
p = base;
bytes_since_last_print = 0;
/* Produce the ps command that lists only our VSZ and RSS. */
snprintf_return = snprintf(
system_cmd,
sizeof(system_cmd),
"ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == \"%ju\") print}'",
(uintmax_t)getpid()
);
assert(snprintf_return >= 0);
assert((size_t)snprintf_return < sizeof(system_cmd));
bytes_since_last_print = print_interval;
do {
/* Modify a byte in the page. */
*p = i;
p += page_size;
bytes_since_last_print += page_size;
/* Print process memory usage every print_interval bytes.
* We count memory using a few techniques from:
* https://stackoverflow.com/questions/1558402/memory-usage-of-current-process-in-c */
if (bytes_since_last_print > print_interval) {
bytes_since_last_print -= print_interval;
printf("extra_memory_committed %lu KiB\n", (i * page_size) / 1024);
ProcStat_init(&proc_statm);
/* Check /proc/self/statm */
printf(
"/proc/self/statm size resident %lu %lu KiB\n",
(proc_statm.size * page_size) / 1024,
(proc_statm.resident * page_size) / 1024
);
/* Check ps. */
puts(system_cmd);
system(system_cmd);
puts("");
}
i++;
} while (p < base + nbytes);
/* Cleanup. */
munmap(base, nbytes);
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and run:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
echo 1 | sudo tee /proc/sys/vm/overcommit_memory
sudo dmesg -c
./main.out 0x1000000000 0x200000000
echo $?
sudo dmesg
where:
0x1000000000 == 64GiB: 2x my computer's physical RAM of 32GiB
0x200000000 == 8GiB: print the memory every 8GiB, so we should get 4 prints before the crash at around 32GiB
echo 1 | sudo tee /proc/sys/vm/overcommit_memory: required for Linux to allow us to make a mmap call larger than physical RAM: Maximum memory which malloc can allocate
Program output:
extra_memory_committed 0 KiB
/proc/self/statm size resident 67111332 768 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 1648
extra_memory_committed 8388608 KiB
/proc/self/statm size resident 67111332 8390244 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 8390256
extra_memory_committed 16777216 KiB
/proc/self/statm size resident 67111332 16778852 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 16778864
extra_memory_committed 25165824 KiB
/proc/self/statm size resident 67111332 25167460 KiB
ps -o pid,vsz,rss | awk '{if (NR == 1 || $1 == "29827") print}'
PID VSZ RSS
29827 67111332 25167472
Killed
Exit status:
137
which by the 128 + signal number rule means we got signal number 9, which man 7 signal says is SIGKILL, which is sent by the Linux out-of-memory killer.
Output interpretation:
VSZ virtual memory remains constant at printf '0x%X\n' 0x40009A4 KiB ~= 64GiB (ps values are in KiB) after the mmap.
RSS "real memory usage" increases lazily only as we touch the pages. For example:
on the first print, we have extra_memory_committed 0, which means we haven't yet touched any pages. RSS is a small 1648 KiB which has been allocated for normal program startup like text area, globals, etc.
on the second print, we have written to 8388608 KiB == 8GiB worth of pages. As a result, RSS increased by exactly 8GIB to 8390256 KiB == 8388608 KiB + 1648 KiB
RSS continues to increase in 8GiB increments. The last print shows about 24 GiB of memory, and before 32 GiB could be printed, the OOM killer killed the process
See also: https://unix.stackexchange.com/questions/35129/need-explanation-on-resident-set-size-virtual-size
OOM killer logs
Our dmesg commands have shown the OOM killer logs.
An exact interpretation of those has been asked at:
Understanding the Linux oom-killer's logs but let's have a quick look here.
https://serverfault.com/questions/548736/how-to-read-oom-killer-syslog-messages
The very first line of the log was:
[ 7283.479087] mongod invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
So we see that interestingly it was the MongoDB daemon that always runs in my laptop on the background that first triggered the OOM killer, presumably when the poor thing was trying to allocate some memory.
However, the OOM killer does not necessarily kill the one who awoke it.
After the invocation, the kernel prints a table or processes including the oom_score:
[ 7283.479292] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 7283.479303] [ 496] 0 496 16126 6 172032 484 0 systemd-journal
[ 7283.479306] [ 505] 0 505 1309 0 45056 52 0 blkmapd
[ 7283.479309] [ 513] 0 513 19757 0 57344 55 0 lvmetad
[ 7283.479312] [ 516] 0 516 4681 1 61440 444 -1000 systemd-udevd
and further ahead we see that our own little main.out actually got killed on the previous invocation:
[ 7283.479871] Out of memory: Kill process 15665 (main.out) score 865 or sacrifice child
[ 7283.479879] Killed process 15665 (main.out) total-vm:67111332kB, anon-rss:92kB, file-rss:4kB, shmem-rss:30080832kB
[ 7283.479951] oom_reaper: reaped process 15665 (main.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:30080832kB
This log mentions the score 865 which that process had, presumably the highest (worst) OOM killer score as mentioned at: https://unix.stackexchange.com/questions/153585/how-does-the-oom-killer-decide-which-process-to-kill-first
Also interestingly, everything apparently happened so fast that before the freed memory was accounted, the oom was awoken again by the DeadlineMonitor process:
[ 7283.481043] DeadlineMonitor invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
and this time that killed some Chromium process, which is usually my computers normal memory hog:
[ 7283.481773] Out of memory: Kill process 11786 (chromium-browse) score 306 or sacrifice child
[ 7283.481833] Killed process 11786 (chromium-browse) total-vm:1813576kB, anon-rss:208804kB, file-rss:0kB, shmem-rss:8380kB
[ 7283.497847] oom_reaper: reaped process 11786 (chromium-browse), now anon-rss:0kB, file-rss:0kB, shmem-rss:8044kB
Tested in Ubuntu 19.04, Linux kernel 5.0.0.
Linux kernel docs
https://github.com/torvalds/linux/blob/v5.17/Documentation/filesystems/proc.rst has some points. The term "VSZ" is not used there but "RSS" is, and there's nothing too enlightening (surprise?!)
Instead of VSZ, the kernel seems to use the term VmSize, which appears e.g. on /proc/$PID/status.
Some quotes of interest:
The first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. Following lines show the size of the mapping (size); the size of each page allocated when backing a VMA (KernelPageSize), which is usually the same as the size in the page table entries; the page size used by the MMU when backing a VMA (in most cases, the same as KernelPageSize); the amount of the mapping that is currently resident in RAM (RSS); the process' proportional share of this mapping (PSS); and the number of clean and dirty shared and private pages in the mapping.
The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500.
Note that even a page which is part of a MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used by only one process, is accounted as private and not as shared.
So we can guess a few more things:
shared libraries used by a single process appear in RSS, if more than one process has them then not
PSS was mentioned by jmh, and has a more proportional approach between "I'm the only process that holds the shared library" and "there are N process holding the shared library, so each one holds memory/N on average"
VSZ - Virtual Set Size
The Virtual Set Size is a memory size assigned to a process ( program ) during the initial execution. The Virtual Set Size memory is simply a number of how much memory a process has available for its execution.
RSS - Resident Set Size (kinda RAM)
As opposed to VSZ ( Virtual Set Size ), RSS is a memory currently used by a process. This is a actual number in kilobytes of how much RAM the current process is using.
Source
I think much has already been said, about RSS vs VSZ. From an administrator/programmer/user perspective, when I design/code applications I am more concerned about the RSZ, (Resident memory), as and when you keep pulling more and more variables (heaped) you will see this value shooting up. Try a simple program to build malloc based space allocation in loop, and make sure you fill data in that malloc'd space. RSS keeps moving up.
As far as VSZ is concerned, it's more of virtual memory mapping that linux does, and one of its core features derived out of conventional operating system concepts. The VSZ management is done by Virtual memory management of the kernel, for more info on VSZ, see Robert Love's description on mm_struct and vm_struct, which are part of basic task_struct data structure in kernel.
To summarize #jmh excellent answer :
In #linux, a process's memory comprises :
its own binary
its shared libs
its stack and heap
Due to paging, not all of those are always fully in memory, only the useful, most recently used parts (pages) are. Other parts are paged out (or swapped out) to make room for other processes.
The table below, taken from #jmh's answer, shows an example of what Resident and Virtual memory are for a specific process.
+-------------+-------------------------+------------------------+
| portion | actually in memory | total (allocated) size |
|-------------+-------------------------+------------------------|
| binary | 400K | 500K |
| shared libs | 1000K | 2500K |
| stack+heap | 100K | 200K |
|-------------+-------------------------+------------------------|
| | RSS (Resident Set Size) | VSZ (Virtual Set Size) |
|-------------+-------------------------+------------------------|
| | 1500K | 3200K |
+-------------+-------------------------+------------------------+
To summarize : resident memory is what is actually in physical memory right now, and virtual size is the total, necessary physical memory to load all components.
Of course, numbers don't add up, because libraries are shared between multiple processes and their memory is counted for each process separately, even if they are loaded only once.
They are not managed, but measured and possibly limited (see getrlimit system call, also on getrlimit(2)).
RSS means resident set size (the part of your virtual address space sitting in RAM).
You can query the virtual address space of process 1234 using proc(5) with cat /proc/1234/maps and its status (including memory consumption) thru cat /proc/1234/status

Where to start learning about linux DMA / device drivers / memory allocation

I'm porting / debugging a device driver (that is used by another kernel module) and facing a dead end because dma_sync_single_for_device() fails with an kernel oops.
I have no clue what that function is supposed to do and googling does not really help, so I probably need to learn more about this stuff in total.
The question is, where to start?
Oh yeah, in case it is relevant, the code is supposed to run on a PowerPC (and the linux is OpenWRT)
EDIT:
On-line resources preferrable (books take a few days to be delivered :)
On-line:
Anatomy of the Linux slab allocator
Understanding the Linux Virtual Memory Manager
Linux Device Drivers, Third Edition
The Linux Kernel Module Programming Guide
Writing device drivers in Linux: A brief tutorial
Books:
Linux Kernel Development (2nd Edition)
Essential Linux Device Drivers ( Only the first 4 - 5 chapters )
Useful Resources:
the Linux Cross Reference ( Searchable Kernel Source for all Kernels )
API changes in the 2.6 kernel series
dma_sync_single_for_device calls dma_sync_single_range_for_cpu a little further up in the file and this is the source documentation ( I assume that even though this is for arm the interface and behavior are the same ):
/**
380 * dma_sync_single_range_for_cpu
381 * #dev: valid struct device pointer, or NULL for ISA and EISA-like devices
382 * #handle: DMA address of buffer
383 * #offset: offset of region to start sync
384 * #size: size of region to sync
385 * #dir: DMA transfer direction (same as passed to dma_map_single)
386 *
387 * Make physical memory consistent for a single streaming mode DMA
388 * translation after a transfer.
389 *
390 * If you perform a dma_map_single() but wish to interrogate the
391 * buffer using the cpu, yet do not wish to teardown the PCI dma
392 * mapping, you must call this function before doing so. At the
393 * next point you give the PCI dma address back to the card, you
394 * must first the perform a dma_sync_for_device, and then the
395 * device again owns the buffer.
396 */
Understanding The Linux Kernel?
The chapters of the Linux Device Drivers book (in the same series as Understanding the Linux Kernel, recommended by #Matthew Flaschen) might be useful.
You can download the indiivudal chapters from the LWN Website. Chapter 16 deals with DMA.

Resources