Getting pid of pages that are being swapped out

Getting pid of pages that are being swapped out - linux

My goal is to find out process-id of pages which are being swapped out. The Linux Kernel function swap_writepage() takes a pointer to struct page as a part of formal argument while swapping a page on backing store. All swap-out operations are done by "kswapd" process. I need to find out pid(s) of the processes whose page is passed as argument in the swap_writepage() function. In order to get that, I was able to find all page table entries associated with that page using rmap structures.
How can I get pid from a pte or from struct page? I have used sytemtap to get the value of struct page pointer, received in swap_writepage() function as argument. Also, the pid() function prints the pid of current process running not the pid of process to which that page belongs which always gives kswapd process.

Here is the example of how reverse mapping used in modern Linux (copied from lxr):
1435 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
1436 {
1437 struct anon_vma *anon_vma;
1438 struct anon_vma_chain *avc;
1439 int ret = SWAP_AGAIN;
1440
1441 anon_vma = page_lock_anon_vma(page);
1442 if (!anon_vma)
1443 return ret;
1444
1445 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
1446 struct vm_area_struct *vma = avc->vma;
1447 unsigned long address;
1448
1449 /*
1450 * During exec, a temporary VMA is setup and later moved.
1451 * The VMA is moved under the anon_vma lock but not the
1452 * page tables leading to a race where migration cannot
1453 * find the migration ptes. Rather than increasing the
1454 * locking requirements of exec(), migration skips
1455 * temporary VMAs until after exec() completes.
1456 */
1457 if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
1458 is_vma_temporary_stack(vma))
1459 continue;
1460
1461 address = vma_address(page, vma);
1462 if (address == -EFAULT)
1463 continue;
1464 ret = try_to_unmap_one(page, vma, address, flags);
1465 if (ret != SWAP_AGAIN || !page_mapped(page))
1466 break;
1467 }
1468
1469 page_unlock_anon_vma(anon_vma);
1470 return ret;
1471 }
This example shows for rmap used for unmapping pages. So each anonymous page in ->mapping field holds anon_vma object. anon_vma holds a list of vma areas page is mapped to. Having vma you have mm, having mm you have a task_struct. that's it. If you have any doubts - here is the illustraction
Daniel P. Bovet, Marco Cesati Understanding Linux Kernel chapter 17.2

Related

How to read /proc/<pid>/pagemap in a kernel driver?

I am trying to read /proc//pagemap in a kernel driver like this:
uint64_t page;
uint64_t va = 0x7FFD1BF46530;`
loff_t pos = va / PAGE_SIZE * sizeof(uint64_t);
struct file * filp = filp_open("/proc/19030/pagemap", O_RDONLY, 0);
ssize_t nread = kernel_read(filp, &page, sizeof(page), &pos);
I get error -22 in nread (EINVAL, invalid argument) and
"kernel read not supported for file /19030/pagemap (pid: 19030 comm: tester)" in dmesg.
0x7FFD1BF46530 is a virtual address in a user space process pid 19030 (tester). I assume that pos is the offset into the file like in lseek64.
Doing the precise same thing as sudo with same values in a user space process, i.e. reading /proc/19030/pagemap works fine and produces a correct physical address.
The actual thing I am trying to do here is to find the physical address of a user space virtual address. I need the physical address for a device DMA transfer operation and a user space app needs to access this memory. This app allocates 1GB DMA memory with anonymous mmap from THP (Transparent Huge Pages). And I am trying to avoid the need for sudo by reading /proc//pagemap in a kernel driver via ioctl instead.
I would be happy to allocate huge page DMA memory in the driver but don't know how to do that. dma_alloc_coherent is limited to max 4MB allocations. Is there a way to get those allocated as continuous physical memory? I need hundreds of MB or many GB of DMA memory.
Problem with anonymous mmap is that it can only allocate max 1GB huge page as physically continuous memory. Allocating more works but the memory is not physically continuous and unusable for DMA.
Any good ideas or alternative ways of allocating huge pages as DMA memory?
Tried reading file /proc//pagemap in a kernel driver. Expected same results as when reading the file in a user space application which works ok.

"kernel read not supported for file …"
Indeed, as we see in __kernel_read()
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported(file, "read");
it fails if f_op->read_iter isn't or f_op->read is wired up (implemented), which is both the case for a pagemap file.
You could try pagemap_read() instead. – not feasible for reasons in the comments
When I had the problem of getting the physical address for a virtual address in a driver, I included and copied some kernel code (not that I recommend this, but I saw no other solution); here's an extract.
static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr
, unsigned long sz)
{ return NULL; }
void p4d_clear_bad(p4d_t *p4d) { p4d_ERROR(*p4d); p4d_clear(p4d); }
#include "mm/pagewalk.c"
static int pte(pte_t *pte, unsigned long addr
, unsigned long next, struct mm_walk *walk)
{
*(pte_t **)walk->private = pte;
return 1;
}
/* Scan the real Linux page tables and return a PTE pointer for
* a virtual address in a context.
* Returns true (1) if PTE was found, zero otherwise. The pointer to
* the PTE pointer is unmodified if PTE is not found.
*/
int
get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp)
{
struct mm_walk walk = { .pte_entry = pte, .mm = mm, .private = ptep };
return walk_page_range(addr, addr+PAGE_SIZE, &walk);
}
/* Find physical address for this virtual address. Normally used by
* I/O functions, but anyone can call it.
*/
static inline unsigned long iopa(unsigned long addr)
{
unsigned long pa;
/* I don't know why this won't work on PMacs or CHRP. It
* appears there is some bug, or there is some implicit
* mapping done not properly represented by BATs or in page
* tables.......I am actively working on resolving this, but
* can't hold up other stuff. -- Dan
*/
pte_t *pte;
struct mm_struct *mm;
#if 0
/* Check the BATs */
phys_addr_t v_mapped_by_bats(unsigned long va);
pa = v_mapped_by_bats(addr);
if (pa)
return pa;
#endif
/* Allow mapping of user addresses (within the thread)
* for DMA if necessary.
*/
if (addr < TASK_SIZE)
mm = current->mm;
else
mm = &init_mm;
ATTENTION: I needed the current address space.
You'd have to use mm = file->private_data instead.
pa = 0;
if (get_pteptr(mm, addr, &pte, NULL))
pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK);
return(pa);
}

why in Linux kernel we have to do the check: if (skb_shinfo(head)->frag_list) , at line 640 in file of ip_fragment.c

In Linux kernel V2.6.23, when the receiver executes ip_frag_reasm(struct ipq *qp, struct net_device *dev) function to reasamble a IP packet from its fragments, there is a check as follows:
614 static struct sk_buff *ip_frag_reasm(struct ipq *qp, struct net_device *dev)
615 {
... ...
637 /* If the first fragment is fragmented itself, we split
638 * it to two chunks: the first with data and paged part
639 * and the second, holding only fragments. */
640 if (skb_shinfo(head)->frag_list) {
... ...
I don't know the reason why kernel shall check this condition at line 640, as we know, each ip fragment received from the network is an independent packet, so skb_shinfo(head)->frag_list is NULL obviously. can any one give me a context in which that skb_shinfo(head)->frag_list is not NULL when the skb is received? thanks a lot.

Completely Fair Scheduler (CFS): vruntime of long running processes

If vruntime is counted since creation of a process how come such a process even gets a processor if it is competing with a newly created processor-bound process which is younger let say by days?
As I've read the rule is simple: pick the leftmost leaf which is a process with the lowest runtime.
Thanks!

The kernel documentation for CFS kind of glosses over what would be the answer to your question, but mentions it briefly:
In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.
So, vruntime is actually normalized. But the documentation does not go into detail.
How is it actually done?
Normalization happens by means of a min_vruntime value. This min_vruntime value is recorded in the CFS runqueue (struct cfs_rq). The min_vruntime value is the smallest vruntime of all tasks in the rbtree. The value is also used to track all the work done by the cfs_rq.
You can observe an example of normalization being performed in CFS' enqueue_entity() code:
2998 static void
2999 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
3000 {
3001 /*
3002 * Update the normalized vruntime before updating min_vruntime
3003 * through calling update_curr().
3004 */
3005 if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
3006 se->vruntime += cfs_rq->min_vruntime;
3007
3008 /*
3009 * Update run-time statistics of the 'current'.
3010 */
3011 update_curr(cfs_rq);
...
3031 }
You can also observe in update_curr() how vruntime and min_vruntime are kept updated:
701 static void update_curr(struct cfs_rq *cfs_rq)
702 {
703 struct sched_entity *curr = cfs_rq->curr;
...
713
714 curr->exec_start = now;
...
719 curr->sum_exec_runtime += delta_exec;
...
722 curr->vruntime += calc_delta_fair(delta_exec, curr);
723 update_min_vruntime(cfs_rq);
...
733 account_cfs_rq_runtime(cfs_rq, delta_exec);
734 }
The actual update to min_vruntime happens in the aptly named update_min_vruntime() function:
457 static void update_min_vruntime(struct cfs_rq *cfs_rq)
458 {
459 u64 vruntime = cfs_rq->min_vruntime;
460
461 if (cfs_rq->curr)
462 vruntime = cfs_rq->curr->vruntime;
463
464 if (cfs_rq->rb_leftmost) {
465 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
466 struct sched_entity,
467 run_node);
468
469 if (!cfs_rq->curr)
470 vruntime = se->vruntime;
471 else
472 vruntime = min_vruntime(vruntime, se->vruntime);
473 }
474
475 /* ensure we never gain time by being placed backwards. */
476 cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
...
481 }
By ensuring that min_vruntime is properly updated, it follows that normalization based on min_vruntime stays consistent. (You can see more examples of where normalization based on min_vruntime occurs by grepping for "normalize" or "min_vruntime" in fair.c.)
So in simple terms, all CFS tasks' vruntime values are normalized based on the current min_vruntime, which ensures that in your example, the newer task's vruntime will rapidly approach equilibrium with the older task's vruntime. (We know this because the documentation states that min_vruntime is monotonically increasing.)

Can malloc_trim() release memory from the middle of the heap?

I am confused about the behaviour of malloc_trim as implemented in the glibc.
man malloc_trim
[...]
malloc_trim - release free memory from the top of the heap
[...]
This function cannot release free memory located at places other than the top of the heap.
When I now look up the source of malloc_trim() (in malloc/malloc.c) I see that it calls mtrim() which is utilizing madvise(x, MADV_DONTNEED) to release memory back to the operating system.
So I wonder if the man-page is wrong or if I misinterpret the source in malloc/malloc.c.
Can malloc_trim() release memory from the middle of the heap?

There are two usages of madvise with MADV_DONTNEED in glibc now: http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
There was https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc commit by Ulrich Drepper on 16 Dec 2007 (part of glibc 2.9 and newer):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
mTRIm (now mtrim) implementation was changed. Unused parts of chunks, aligned on page size and having size more than page may be marked as MADV_DONTNEED:
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
Man page of malloc_trim is there: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and it was committed by kerrisk in 2012: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c (and malloc_trim description in commend was not updated in 2007) and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
You may ask Drepper and other glibc developers again as you already did in https://sourceware.org/ml/libc-help/2015-02/msg00022.html "malloc_trim() behaviour", but there is still no reply from them. (Only wrong answers from other users like https://sourceware.org/ml/libc-help/2015-05/msg00007.html https://sourceware.org/ml/libc-help/2015-05/msg00008.html)
Or you may test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
free(m2);
malloc_trim(0); // 20000, 2000000
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there is madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.

... utilizing madvise(x, MADV_DONTNEED) to release memory back to the
operating system.
madvise(x, MADV_DONTNEED) does not release memory. man madvise:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the kernel
can free resources associated with it.) Subsequent accesses of
pages in this range will succeed, but will result either in
reloading of the memory contents from the underlying mapped file
(see mmap(2)) or zero-fill-on-demand pages for mappings without
an underlying file.
So, the usage of madvise(x, MADV_DONTNEED) does not contradict man malloc_trim's statement:
This function cannot release free memory located at places other than the top of the heap.

How to identify read or write operations of page fault when using sigaction handler on SIGSEGV?(LINUX)

I use sigaction to handle page fault exception, and the handler function is defind like this:
void sigaction_handler(int signum, siginfo_t *info, void *_context)
So it's easy to get page fault address by reading info->si_addr.
The question is, how to know whether this operation is memory READ or WRITE ?
I found the type of _context parameter is ucontext_t defined in /usr/include/sys/ucontext.h
There is a cr2 field defined in mcontext_t, but unforunately, it is only avaliable when x86_64 is not defind, thus I could not used cr2 to identify read/write operations.
On anotherway, there is a struct named sigcontext defined in /usr/include/bits/sigcontext.h
This struct contains cr2 field. But I don't know where to get it.

You can check this in x86_64 by referring to the ucontext's mcontext struct and the err register:
void pf_sighandler(int sig, siginfo_t *info, ucontext_t *ctx) {
...
if (ctx->uc_mcontext.gregs[REG_ERR] & 0x2) {
// Write fault
} else {
// Read fault
}
...
}

Here is the generation of SIGSEGV from the kernel arch/x86/mm/fault.c, __bad_area_nosemaphore() function:
http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/mm/fault.c#L760
760 tsk->thread.cr2 = address;
761 tsk->thread.error_code = error_code;
762 tsk->thread.trap_nr = X86_TRAP_PF;
763
764 force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
There is error_code field, and it values are defined in arch/x86/mm/fault.c too:
http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/mm/fault.c#L23
23/*
24 * Page fault error code bits:
25 *
26 * bit 0 == 0: no page found 1: protection fault
27 * bit 1 == 0: read access 1: write access
28 * bit 2 == 0: kernel-mode access 1: user-mode access
29 * bit 3 == 1: use of reserved bit detected
30 * bit 4 == 1: fault was an instruction fetch
31 */
32enum x86_pf_error_code {
33
34 PF_PROT = 1 << 0,
35 PF_WRITE = 1 << 1,
36 PF_USER = 1 << 2,
37 PF_RSVD = 1 << 3,
38 PF_INSTR = 1 << 4,
39};
So, exact information about access type is stored in the thread_struct.error_code: http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/include/asm/processor.h#L470
The error_code field is not exported into siginfo_t struct as I see (it is defined in
http://man7.org/linux/man-pages/man2/sigaction.2.html .. search for si_signo).
So you can
Hack the kernel to export tsk->thread.error_code (or check, is it exported already or not, for example in ptrace)
Get the memory address, read /proc/self/maps, parse them and check access bits on the page. If the page is present and read-only, the only possible fault is from writing, if page is not present both kinds of access are possible, and if... there should be no write-only pages.
Also you can try to find the address of failed instruction, read it and disassemble.

The error_code information can be accessed through:
err = ((ucontext_t*)context)->uc_mcontext.gregs[REG_ERR]
It is passed by the hardware on the stack, which is then passed to the signal handler by the kernel, since the kernel passes the entire `frame'. Then
bool write_fault = !(err & 0x2);
will be true if the access was a write access, and false otherwise.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string