I'm trying to write code that gets a page and returns it PTE (page table entry) in the Linux kernel.
The prototype of the function should be something like this:
static pte_t getPteOfPage(struct page *page);
I tried to find the PTE of the page in the struct description of the page, but it is more complicated.
Can anyone show how to do it?
Good starting point to look at the walk_page_range function.
/**
143 * walk_page_range - walk a memory map's page tables with a callback
144 * #addr: starting address
145 * #end: ending address
146 * #walk: set of callbacks to invoke for each level of the tree
147 *
148 * Recursively walk the page table for the memory area in a VMA,
149 * calling supplied callbacks. Callbacks are called in-order (first
150 * PGD, first PUD, first PMD, first PTE, second PTE... second PMD,
151 * etc.). If lower-level callbacks are omitted, walking depth is reduced.
152 *
153 * Each callback receives an entry pointer and the start and end of the
154 * associated range, and a copy of the original mm_walk for access to
155 * the ->private or ->mm fields.
156 *
157 * Usually no locks are taken, but splitting transparent huge page may
158 * take page table lock. And the bottom level iterator will map PTE
159 * directories from highmem if necessary.
160 *
161 * If any callback returns a non-zero value, the walk is aborted and
162 * the return value is propagated back to the caller. Otherwise 0 is returned.
163 *
164 * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry
165 * is !NULL.
166 */
See the walk_page_range function implementation, for a good example.
Related
If vruntime is counted since creation of a process how come such a process even gets a processor if it is competing with a newly created processor-bound process which is younger let say by days?
As I've read the rule is simple: pick the leftmost leaf which is a process with the lowest runtime.
Thanks!
The kernel documentation for CFS kind of glosses over what would be the answer to your question, but mentions it briefly:
In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.
So, vruntime is actually normalized. But the documentation does not go into detail.
How is it actually done?
Normalization happens by means of a min_vruntime value. This min_vruntime value is recorded in the CFS runqueue (struct cfs_rq). The min_vruntime value is the smallest vruntime of all tasks in the rbtree. The value is also used to track all the work done by the cfs_rq.
You can observe an example of normalization being performed in CFS' enqueue_entity() code:
2998 static void
2999 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
3000 {
3001 /*
3002 * Update the normalized vruntime before updating min_vruntime
3003 * through calling update_curr().
3004 */
3005 if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
3006 se->vruntime += cfs_rq->min_vruntime;
3007
3008 /*
3009 * Update run-time statistics of the 'current'.
3010 */
3011 update_curr(cfs_rq);
...
3031 }
You can also observe in update_curr() how vruntime and min_vruntime are kept updated:
701 static void update_curr(struct cfs_rq *cfs_rq)
702 {
703 struct sched_entity *curr = cfs_rq->curr;
...
713
714 curr->exec_start = now;
...
719 curr->sum_exec_runtime += delta_exec;
...
722 curr->vruntime += calc_delta_fair(delta_exec, curr);
723 update_min_vruntime(cfs_rq);
...
733 account_cfs_rq_runtime(cfs_rq, delta_exec);
734 }
The actual update to min_vruntime happens in the aptly named update_min_vruntime() function:
457 static void update_min_vruntime(struct cfs_rq *cfs_rq)
458 {
459 u64 vruntime = cfs_rq->min_vruntime;
460
461 if (cfs_rq->curr)
462 vruntime = cfs_rq->curr->vruntime;
463
464 if (cfs_rq->rb_leftmost) {
465 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
466 struct sched_entity,
467 run_node);
468
469 if (!cfs_rq->curr)
470 vruntime = se->vruntime;
471 else
472 vruntime = min_vruntime(vruntime, se->vruntime);
473 }
474
475 /* ensure we never gain time by being placed backwards. */
476 cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
...
481 }
By ensuring that min_vruntime is properly updated, it follows that normalization based on min_vruntime stays consistent. (You can see more examples of where normalization based on min_vruntime occurs by grepping for "normalize" or "min_vruntime" in fair.c.)
So in simple terms, all CFS tasks' vruntime values are normalized based on the current min_vruntime, which ensures that in your example, the newer task's vruntime will rapidly approach equilibrium with the older task's vruntime. (We know this because the documentation states that min_vruntime is monotonically increasing.)
My goal is to find out process-id of pages which are being swapped out. The Linux Kernel function swap_writepage() takes a pointer to struct page as a part of formal argument while swapping a page on backing store. All swap-out operations are done by "kswapd" process. I need to find out pid(s) of the processes whose page is passed as argument in the swap_writepage() function. In order to get that, I was able to find all page table entries associated with that page using rmap structures.
How can I get pid from a pte or from struct page? I have used sytemtap to get the value of struct page pointer, received in swap_writepage() function as argument. Also, the pid() function prints the pid of current process running not the pid of process to which that page belongs which always gives kswapd process.
Here is the example of how reverse mapping used in modern Linux (copied from lxr):
1435 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
1436 {
1437 struct anon_vma *anon_vma;
1438 struct anon_vma_chain *avc;
1439 int ret = SWAP_AGAIN;
1440
1441 anon_vma = page_lock_anon_vma(page);
1442 if (!anon_vma)
1443 return ret;
1444
1445 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
1446 struct vm_area_struct *vma = avc->vma;
1447 unsigned long address;
1448
1449 /*
1450 * During exec, a temporary VMA is setup and later moved.
1451 * The VMA is moved under the anon_vma lock but not the
1452 * page tables leading to a race where migration cannot
1453 * find the migration ptes. Rather than increasing the
1454 * locking requirements of exec(), migration skips
1455 * temporary VMAs until after exec() completes.
1456 */
1457 if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
1458 is_vma_temporary_stack(vma))
1459 continue;
1460
1461 address = vma_address(page, vma);
1462 if (address == -EFAULT)
1463 continue;
1464 ret = try_to_unmap_one(page, vma, address, flags);
1465 if (ret != SWAP_AGAIN || !page_mapped(page))
1466 break;
1467 }
1468
1469 page_unlock_anon_vma(anon_vma);
1470 return ret;
1471 }
This example shows for rmap used for unmapping pages. So each anonymous page in ->mapping field holds anon_vma object. anon_vma holds a list of vma areas page is mapped to. Having vma you have mm, having mm you have a task_struct. that's it. If you have any doubts - here is the illustraction
Daniel P. Bovet, Marco Cesati Understanding Linux Kernel chapter 17.2
It often happens to me when debugging or playing around in GHCi that I happen to know the actual ThreadId number (for example from using Debug.Trace), but that's all I have.
The problem is that all thread APIs, such as killThread require a ThreadId and not an Int.
I've tried Hoogle but came out empty. Is there a way to do this? I'm concerned mostly with debugging, so I don't mind if it's a nasty hack or if it's through a GHC-only library.
You can't. ThreadId is abstract. The Int you have is actually nothing more than a counter (source):
32 static StgThreadID next_thread_id = 1;
...
59 StgTSO *
60 createThread(Capability *cap, W_ size)
61 {
62 StgTSO *tso;
...
126 ACQUIRE_LOCK(&sched_mutex);
127 tso->id = next_thread_id++; // while we have the mutex
...
130 RELEASE_LOCK(&sched_mutex);
...
136 }
...
161 int
162 rts_getThreadId(StgPtr tso)
163 {
164 return ((StgTSO *)tso)->id;
165 }
It's rts_getThreadId that gets called in ThreadId's Show instance. There's no mapping back to the actual TSO. If you want to know what ThreadId belongs to what Int, you need to keep track of them yourself. You could, for example, parse the Int and fill a Map.
Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!
Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.
This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor
I am trying to write a function in Linux kernel space that walks over a page cache, and searches for a page that contains a specific block.
I don't know how to get the pages in the page-cache one-by-one.
I saw that find_get_page is a function that can help me, but I don't know how to get the first page offset and how to continue.
As I said, I am trying to do something like that:
for(every page in struct address_space *mapping)
{
for(every struct buffer_head in current_page->buffers)
{
check if(my_sector == current_buffer_head->b_blocknr)
...
}
}
Can anyone help to find how to walk over all the page-cache?
I believe that there is a code in Linux kernel that does something like this (for example: when there is a write to a page and the page is searched in the cache), but I didn't find it...
Thanks!
The address_space structure holds all the pages in radix_tree (mapping->page_tree in your case). So all you need is to iterate over that tree. Linux kernel has radix tree API (see here) including the for_each iterators. For eaxmple:
396 /**
397 * radix_tree_for_each_chunk_slot - iterate over slots in one chunk
398 *
399 * #slot: the void** variable, at the beginning points to chunk first slot
400 * #iter: the struct radix_tree_iter pointer
401 * #flags: RADIX_TREE_ITER_*, should be constant
402 *
403 * This macro is designed to be nested inside radix_tree_for_each_chunk().
404 * #slot points to the radix tree slot, #iter->index contains its index.
405 */
406 #define radix_tree_for_each_chunk_slot(slot, iter, flags) \
407 for (; slot ; slot = radix_tree_next_slot(slot, iter, flags))
408