Completely Fair Scheduler (CFS): vruntime of long running processes - linux

If vruntime is counted since creation of a process how come such a process even gets a processor if it is competing with a newly created processor-bound process which is younger let say by days?
As I've read the rule is simple: pick the leftmost leaf which is a process with the lowest runtime.
Thanks!

The kernel documentation for CFS kind of glosses over what would be the answer to your question, but mentions it briefly:
In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.
So, vruntime is actually normalized. But the documentation does not go into detail.
How is it actually done?
Normalization happens by means of a min_vruntime value. This min_vruntime value is recorded in the CFS runqueue (struct cfs_rq). The min_vruntime value is the smallest vruntime of all tasks in the rbtree. The value is also used to track all the work done by the cfs_rq.
You can observe an example of normalization being performed in CFS' enqueue_entity() code:
2998 static void
2999 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
3000 {
3001 /*
3002 * Update the normalized vruntime before updating min_vruntime
3003 * through calling update_curr().
3004 */
3005 if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
3006 se->vruntime += cfs_rq->min_vruntime;
3007
3008 /*
3009 * Update run-time statistics of the 'current'.
3010 */
3011 update_curr(cfs_rq);
...
3031 }
You can also observe in update_curr() how vruntime and min_vruntime are kept updated:
701 static void update_curr(struct cfs_rq *cfs_rq)
702 {
703 struct sched_entity *curr = cfs_rq->curr;
...
713
714 curr->exec_start = now;
...
719 curr->sum_exec_runtime += delta_exec;
...
722 curr->vruntime += calc_delta_fair(delta_exec, curr);
723 update_min_vruntime(cfs_rq);
...
733 account_cfs_rq_runtime(cfs_rq, delta_exec);
734 }
The actual update to min_vruntime happens in the aptly named update_min_vruntime() function:
457 static void update_min_vruntime(struct cfs_rq *cfs_rq)
458 {
459 u64 vruntime = cfs_rq->min_vruntime;
460
461 if (cfs_rq->curr)
462 vruntime = cfs_rq->curr->vruntime;
463
464 if (cfs_rq->rb_leftmost) {
465 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
466 struct sched_entity,
467 run_node);
468
469 if (!cfs_rq->curr)
470 vruntime = se->vruntime;
471 else
472 vruntime = min_vruntime(vruntime, se->vruntime);
473 }
474
475 /* ensure we never gain time by being placed backwards. */
476 cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
...
481 }
By ensuring that min_vruntime is properly updated, it follows that normalization based on min_vruntime stays consistent. (You can see more examples of where normalization based on min_vruntime occurs by grepping for "normalize" or "min_vruntime" in fair.c.)
So in simple terms, all CFS tasks' vruntime values are normalized based on the current min_vruntime, which ensures that in your example, the newer task's vruntime will rapidly approach equilibrium with the older task's vruntime. (We know this because the documentation states that min_vruntime is monotonically increasing.)

Related

A thread who is spinning and trying to get the spinlock can't be preempted?

When a thread on Linux is spinning and trying to get the spinlock, Is there no chance this thread can be preempted?
EDIT:
I just want to make sure some thing. On a "UP" system, and there is no interrupt handler will access this spinlock. If the thread who is spinning and trying to get the spinlock can be preempted, I think in this case, the critical section which spinlock protects can call sleep, since the thread holding spinlock can be re-scheduled back to CPU.
No it cannot be preempted: see the code (taken from linux sources) http://lxr.free-electrons.com/source/include/linux/spinlock_api_smp.h?v=2.6.32#L241
241 static inline unsigned long __spin_lock_irqsave(spinlock_t *lock)
242 {
243 unsigned long flags;
244
245 local_irq_save(flags);
246 preempt_disable();
247 spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
248 /*
249 * On lockdep we dont want the hand-coded irq-enable of
250 * _raw_spin_lock_flags() code, because lockdep assumes
251 * that interrupts are not re-enabled during lock-acquire:
252 */
253 #ifdef CONFIG_LOCKDEP
254 LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
255 #else
256 _raw_spin_lock_flags(lock, &flags);
257 #endif
258 return flags;
259 }
260
[...]
349 static inline void __spin_unlock(spinlock_t *lock)
350 {
351 spin_release(&lock->dep_map, 1, _RET_IP_);
352 _raw_spin_unlock(lock);
353 preempt_enable();
354 }
see lines 246 and 353
By the way It is generally a bad idea to sleep while holding a lock (spinlock or not)

Getting pid of pages that are being swapped out

My goal is to find out process-id of pages which are being swapped out. The Linux Kernel function swap_writepage() takes a pointer to struct page as a part of formal argument while swapping a page on backing store. All swap-out operations are done by "kswapd" process. I need to find out pid(s) of the processes whose page is passed as argument in the swap_writepage() function. In order to get that, I was able to find all page table entries associated with that page using rmap structures.
How can I get pid from a pte or from struct page? I have used sytemtap to get the value of struct page pointer, received in swap_writepage() function as argument. Also, the pid() function prints the pid of current process running not the pid of process to which that page belongs which always gives kswapd process.
Here is the example of how reverse mapping used in modern Linux (copied from lxr):
1435 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
1436 {
1437 struct anon_vma *anon_vma;
1438 struct anon_vma_chain *avc;
1439 int ret = SWAP_AGAIN;
1440
1441 anon_vma = page_lock_anon_vma(page);
1442 if (!anon_vma)
1443 return ret;
1444
1445 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
1446 struct vm_area_struct *vma = avc->vma;
1447 unsigned long address;
1448
1449 /*
1450 * During exec, a temporary VMA is setup and later moved.
1451 * The VMA is moved under the anon_vma lock but not the
1452 * page tables leading to a race where migration cannot
1453 * find the migration ptes. Rather than increasing the
1454 * locking requirements of exec(), migration skips
1455 * temporary VMAs until after exec() completes.
1456 */
1457 if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
1458 is_vma_temporary_stack(vma))
1459 continue;
1460
1461 address = vma_address(page, vma);
1462 if (address == -EFAULT)
1463 continue;
1464 ret = try_to_unmap_one(page, vma, address, flags);
1465 if (ret != SWAP_AGAIN || !page_mapped(page))
1466 break;
1467 }
1468
1469 page_unlock_anon_vma(anon_vma);
1470 return ret;
1471 }
This example shows for rmap used for unmapping pages. So each anonymous page in ->mapping field holds anon_vma object. anon_vma holds a list of vma areas page is mapped to. Having vma you have mm, having mm you have a task_struct. that's it. If you have any doubts - here is the illustraction
Daniel P. Bovet, Marco Cesati Understanding Linux Kernel chapter 17.2

Using DMA API in linux kernel but channel is never available

I am trying to use dmatest.c to test DMA in intel xeon server and regular laptop with i7 processor. It is never been able to get a channel - I found this out by debugging the dmatest.c itself. Line 854 below is always executed (I put my own printk there).
Is there anything I should do to get this API to work before executing (such as dma modules or anything?)
Or, do I use wrong API set?
On the Xeon server, I did research and it has ioatdma.ko module that can be loaded.
modprobe ioatdma
and some files available at /sys/class/dma after that, such as dma0channel0, dma1channel0 .... etc
However, running dmatest code, it still can't get any channel.
Any help or hint is appreciated.
836 static void request_channels(struct dmatest_info *info,
837 enum dma_transaction_type type)
838 {
839 dma_cap_mask_t mask;
840
841 dma_cap_zero(mask);
842 dma_cap_set(type, mask);
843 for (;;) {
844 struct dmatest_params *params = &info->params;
845 struct dma_chan *chan;
846
847 chan = dma_request_channel(mask, filter, params);
848 if (chan) {
849 if (dmatest_add_channel(info, chan)) {
850 dma_release_channel(chan);
851 break; /* add_channel failed, punt */
852 }
853 } else
854 break; /* no more channels available */
The test commands that I used (following dmatest.txt document in kernel doc):
% echo dma0chan0 > /sys/kernel/debug/dmatest/channel
% echo 2000 > /sys/kernel/debug/dmatest/timeout
% echo 1 > /sys/kernel/debug/dmatest/iterations
% echo 1 > /sys/kernel/debug/dmatest/run

Can malloc_trim() release memory from the middle of the heap?

I am confused about the behaviour of malloc_trim as implemented in the glibc.
man malloc_trim
[...]
malloc_trim - release free memory from the top of the heap
[...]
This function cannot release free memory located at places other than the top of the heap.
When I now look up the source of malloc_trim() (in malloc/malloc.c) I see that it calls mtrim() which is utilizing madvise(x, MADV_DONTNEED) to release memory back to the operating system.
So I wonder if the man-page is wrong or if I misinterpret the source in malloc/malloc.c.
Can malloc_trim() release memory from the middle of the heap?
There are two usages of madvise with MADV_DONTNEED in glibc now: http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
There was https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc commit by Ulrich Drepper on 16 Dec 2007 (part of glibc 2.9 and newer):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
mTRIm (now mtrim) implementation was changed. Unused parts of chunks, aligned on page size and having size more than page may be marked as MADV_DONTNEED:
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
Man page of malloc_trim is there: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and it was committed by kerrisk in 2012: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c (and malloc_trim description in commend was not updated in 2007) and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
You may ask Drepper and other glibc developers again as you already did in https://sourceware.org/ml/libc-help/2015-02/msg00022.html "malloc_trim() behaviour", but there is still no reply from them. (Only wrong answers from other users like https://sourceware.org/ml/libc-help/2015-05/msg00007.html https://sourceware.org/ml/libc-help/2015-05/msg00008.html)
Or you may test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
free(m2);
malloc_trim(0); // 20000, 2000000
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there is madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
... utilizing madvise(x, MADV_DONTNEED) to release memory back to the
operating system.
madvise(x, MADV_DONTNEED) does not release memory. man madvise:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the kernel
can free resources associated with it.) Subsequent accesses of
pages in this range will succeed, but will result either in
reloading of the memory contents from the underlying mapped file
(see mmap(2)) or zero-fill-on-demand pages for mappings without
an underlying file.
So, the usage of madvise(x, MADV_DONTNEED) does not contradict man malloc_trim's statement:
This function cannot release free memory located at places other than the top of the heap.

get pte of page in linux

I'm trying to write code that gets a page and returns it PTE (page table entry) in the Linux kernel.
The prototype of the function should be something like this:
static pte_t getPteOfPage(struct page *page);
I tried to find the PTE of the page in the struct description of the page, but it is more complicated.
Can anyone show how to do it?
Good starting point to look at the walk_page_range function.
/**
143 * walk_page_range - walk a memory map's page tables with a callback
144 * #addr: starting address
145 * #end: ending address
146 * #walk: set of callbacks to invoke for each level of the tree
147 *
148 * Recursively walk the page table for the memory area in a VMA,
149 * calling supplied callbacks. Callbacks are called in-order (first
150 * PGD, first PUD, first PMD, first PTE, second PTE... second PMD,
151 * etc.). If lower-level callbacks are omitted, walking depth is reduced.
152 *
153 * Each callback receives an entry pointer and the start and end of the
154 * associated range, and a copy of the original mm_walk for access to
155 * the ->private or ->mm fields.
156 *
157 * Usually no locks are taken, but splitting transparent huge page may
158 * take page table lock. And the bottom level iterator will map PTE
159 * directories from highmem if necessary.
160 *
161 * If any callback returns a non-zero value, the walk is aborted and
162 * the return value is propagated back to the caller. Otherwise 0 is returned.
163 *
164 * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry
165 * is !NULL.
166 */
See the walk_page_range function implementation, for a good example.

Resources