How is run queue length computed in linux proc filesystem - linux

I'm trying to obtain number of runnable processes from linux kernel. sar -q gives this information readily. However I'm trying to get this value from /proc filesystem. There is no file in /proc that gives this value directly, then how is runq-sz computed.
The wiki page http://en.wikipedia.org/wiki/Load_(computing) provides some insight into how run queue length is computed based on ldavg values but it is unclear.
Can someone provide more pointers on this. Cheers

As gcla said you cat use
cat /proc/loadavg
to read loadavarage from from kernel - but strictly speaking, it is not a queue length.
Take a look at
grep procs_running /proc/stat
and
grep procs_blocked /proc/stat
First is an actual running queue and second is a number of process blocked on disk IO. Load average is a function from sum of both.

here is the function in the sysstat daemon which provides the info sar prints out:
https://github.com/sysstat/sysstat/blob/master/rd_stats.c#L392
if ((fp = fopen(LOADAVG, "r")) == NULL)
return;
/* Read load averages and queue length */
fscanf(fp, "%d.%d %d.%d %d.%d %ld/%d %*d\n",
&load_tmp[0], &st_queue->load_avg_1,
&load_tmp[1], &st_queue->load_avg_5,
&load_tmp[2], &st_queue->load_avg_15,
&st_queue->nr_running,
&st_queue->nr_threads);
It reads from /proc/loadavg, which is populated by this kernel function
http://lxr.free-electrons.com/source/fs/proc/loadavg.c#L13
static int loadavg_proc_show(struct seq_file *m, void *v)
{
unsigned long avnrun[3];
get_avenrun(avnrun, FIXED_1/200, 0);
seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
nr_running(), nr_threads,
task_active_pid_ns(current)->last_pid);
return 0;
}
The nr_running() function provides the total of both currently running tasks and tasks that are ready to run on a CPU; it's an instantaneous measure. I believe this will line up with the sar runq-sz variable.
Graham

Related

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

Linux OS: /proc/[pid]/smaps vs /proc/[pid]/statm

I would like calculate the memory usage for single process. So after a little bit of research I came across over smaps and statm.
First of all what is smaps and statm? What is the difference?
statm has a field RSS and in smaps I sum up all RSS values. But those values are different for the same process. I know that statm measures in pages. For comparison purposes I converted that value in kb as in smaps. But those values are not equal.
Why do these two values differ, even though they represent the rss value for the same process?
statm
232214 80703 7168 27 0 161967 0 (measured in pages, pages size is 4096)
smaps
Rss 1956
My aim is to calculate the memory usage for a single process. I am interested in two values. USS and PSS.
Can I gain those two values by just using smaps? Is that value correct?
Also, I would like to return that value as percentage.
I think statm is an approximated simplification of smaps, which is more expensive to get. I came to this conclusion after I looked at the source:
smaps
The information you see in smaps is defined in /fs/proc/task_mmu.c:
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
(...)
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
.mm = vma->vm_mm,
.private = &mss,
};
memset(&mss, 0, sizeof mss);
walk_page_vma(vma, &smaps_walk);
show_map_vma(m, vma, is_pid);
seq_printf(m,
(...)
"Rss: %8lu kB\n"
(...)
mss.resident >> 10,
The information in mss is used by walk_page_vma defined in /mm/pagewalk.c. However, the mss member resident is not filled in walk_page_vma - instead, walk_page_vma calls callback specified in smaps_walk:
.pmd_entry = smaps_pte_range,
.private = &mss,
like this:
if (walk->pmd_entry)
err = walk->pmd_entry(pmd, addr, next, walk);
So what does our callback, smaps_pte_range in /fs/proc/task_mmu.c, do?
It calls smaps_pte_entry and smaps_pmd_entry in some circumstances, out of which both call statm_account(), which in turn... upgrades resident size! All of these functions are defined in the already linked task_mmu.c so I didn't post relevant code snippets as they can be easily seen in the linked sources.
PTE stands for Page Table Entry and PMD is Page Middle Directory. So basically we iterate through the page entries associated with given process and update RAM usage depending on the circumstances.
statm
The information you see in statm is defined in /fs/proc/array.c:
int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
struct mm_struct *mm = get_task_mm(task);
if (mm) {
size = task_statm(mm, &shared, &text, &data, &resident);
mmput(mm);
}
seq_put_decimal_ull(m, 0, size);
seq_put_decimal_ull(m, ' ', resident);
seq_put_decimal_ull(m, ' ', shared);
seq_put_decimal_ull(m, ' ', text);
seq_put_decimal_ull(m, ' ', 0);
seq_put_decimal_ull(m, ' ', data);
seq_put_decimal_ull(m, ' ', 0);
seq_putc(m, '\n');
return 0;
}
This time, resident is filled by task_statm. This one has two implementations, one in /fs/proc/task_mmu.c and second in /fs/proc/task_nomm.c. Since they're almost surely mutually exclusive, I'll focus on the implementation in task_mmu.c (which also contained task_smaps). In this implementation we see that
unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
*shared = get_mm_counter(mm, MM_FILEPAGES);
(...)
*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
it queries some counters, namely, MM_FILEPAGES and MM_ANONPAGES. These counters are modified during different operations on memory such as do_wp_page defined at /mm/memory.c. All of the modifications seem to be done by the files located in /mm/ and there seem to be quite a lot of them, so I didn't include them here.
Conclusion
smaps does complicated iteration through all referenced memory regions and updates resident size using the collected information. statm uses data that was already calculated by someone else.
The most important part is that while smaps collects the data each time in an independent manner, statm uses counters that get incremented or decremented during process life cycle. There are a lot of places that need to do the bookkeeping, and perhaps some places don't upgrade the counters like they should. That's why IMO statm is inferior to smaps, even if it takes fewer CPU cycles to complete.
Please note that this is the conclusion I drew based on common sense, but I might be wrong - perhaps there are no internal inconsistencies in counter decrementing and incrementing, and instead, they might count some pages differently than smaps. At this point I believe it'd be wise to take it to some experienced kernel maintainers.

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

Accessing large memory (32 GB) using /dev/zero

I want to use /dev/zero for storing lots of temporary data (32 GB or around that). I am doing this:
fd = open("/dev/zero", O_RDWR );
// <Exit on error>
vbase = (uint64_t*) mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
// <Exit on error>
ftruncate(fd, (off_t) MEMSIZE);
I am changing MEMSIZE from 1GB to 32 GB (performing a memtest) to see if I can really access all that range. I am running out of memory at 1 GB.
Is there something I am missing ? Am I mmap'ing correctly ?
Or am I running into some system limit ? How can I check if this is happening ?
P.S: I run many programs that generate many gigs of data within a single file, so I dont know if there is an artificial upper limit, just that I seem to be running into something.
I have to admit I'm confused about what you're actually trying to do. Anyway, a couple of reason why what you do might not work:
From the mmap(2) manpage: "MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored;"
From the null(4) manpage: "Data written to a null or zero special file is discarded."
So anyway, before MAP_ANONYMOUS, mmap'ing /dev/zero was sometimes used to get anonymous (i.e. not backed by any file) memory. No need to do both. In either case, actually writing to all that memory implies that you need some kind of backing store for it, either physical memory or swap space. If you cannot guarantee that, maybe it's better to mmap() a real file on a filesystem with enough space?
Look into Linux kernel mmap implementation:
vm_mmap vm_mmap_pgoff  do_mmap_pgoff  mmap_region  file->f_op->mmap(file, vma)
In the function do_mmap_pgoff, it checks the max_map_count
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
root> sysctl -a | grep map_count
vm.max_map_count = 65530
In the function mmap_region, it checks the process virtual address limit (whether it is unlimited).
int may_expand_vm(struct mm_struct *mm, unsigned long npages)
{
unsigned long cur = mm->total_vm; /* pages */
unsigned long lim;
lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
if (cur + npages > lim)
return 0;
return 1;
}
root> ulimit -a | grep virtual
virtual memory (kbytes, -v) unlimited
In linux kernel, init task has the rlimit setting by default.
[RLIMIT_AS] = { RLIM_INFINITY, RLIM_INFINITY }, \
#ifndef RLIM_INFINITY
# define RLIM_INFINITY (~0UL)
#endif
In order to prove it, use the test_mem program
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
struct rlimit rl;
int ret;
ret = getrlimit(RLIMIT_AS, &rl);
if (ret == 0) {
printf("RLIMIT_AS limit got sucessfully:\n");
printf("soft_limit=%lld, hard_limit=%lld\n", (long long)rl.rlim_cur, (long long)rl.rlim_max);
}
That means unlimited means 0xFFFFFFFF for 32bit app in the 64bit OS. Change the shell virtual address limit, it could reflect correctly.
root> ulimit -v 1024000
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=1048576000, hard_limit=1048576000
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
In mmap_region, there is an accountable check
accountable_mapping  security_vm_enough_memory_mm  cap_vm_enough_memory  __vm_enough_memory  overcommit/swap/admin and user reserve handling.
Please follow the three steps to check whether they can meet.

Writing a syscall to count context switches of a process

I have to do a system call to count the voluntary & involuntary context switches of a process. I already know the steps to add a new system call to a linux kernel but i have no clue of where i should start for the context-switch function. Any idea?
If your syscall should only report statistics, you can use context switch counting code that is already in the kernel.
wait3 syscall or getrusage syscall already reports context switch count in struct rusage fields:
struct rusage {
...
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
You can try it by running:
$ /usr/bin/time -v /bin/ls -R
....
Voluntary context switches: 1669
Involuntary context switches: 207
where "/bin/ls -R" is any program.
By searching an "struct rusage" in kernel sources, you can find this accumulate_thread_rusage in kernel/sys.c, which updates rusage struct. It reads from struct task_struct *t; the fields t->nvcsw; and t->nivcsw;:
1477 static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
1478 {
1479 r->ru_nvcsw += t->nvcsw; // <<=== here
1480 r->ru_nivcsw += t->nivcsw;
1481 r->ru_minflt += t->min_flt;
1482 r->ru_majflt += t->maj_flt;
Then you should search nvcsw and nivcsw in kernel folder to find how they are updated by kernel.
asmlinkage void __sched schedule(void):
4124 if (likely(prev != next)) { // <= if we are switching between different tasks
4125 sched_info_switch(prev, next);
4126 perf_event_task_sched_out(prev, next);
4127
4128 rq->nr_switches++;
4129 rq->curr = next;
4130 ++*switch_count; // <= increment nvcsw or nivcsw via pointer
4131
4132 context_switch(rq, prev, next); /* unlocks the rq */
Pointer switch_count is from line 4091 or line 4111 of the same file.
PS: Link from perreal is great: http://oreilly.com/catalog/linuxkernel/chapter/ch10.html (search context_swtch)
This already exists: the virtual file /proc/NNNN/status (where NNNN is the decimal process ID of the process you want to know about) contains, among other things, counts of both voluntary and involuntary context switches. Unlike getrusage this allows you to learn the context switch counts for any process, not just children. See the proc(5) manpage for more details.
A process will make a context switch in case of blocking, time quantum expiring or for interrupts etc. Eventually schedule() function is called. Since you want to count it for each process separately you have to keep a new variable for each process for counting the no of context switches. And you can update this variable each time in schedule fun for current process. Using your system call you can read this value. Here is a snippet of the schedule function of pintos,
static void
schedule (void)
{
struct thread *cur = running_thread ();
struct thread *next = next_thread_to_run ();
struct thread *prev = NULL;
ASSERT (intr_get_level () == INTR_OFF);
ASSERT (cur->status != THREAD_RUNNING);
ASSERT (is_thread (next));<br/>
if (cur != next)
prev = switch_threads (cur, next); <== here you can update count of "cur"
thread_schedule_tail (prev);
}
Total number of context switches
cat /proc/PID/sched|grep nr_switches
Voluntary context switches
cat /proc/PID/sched | grep nr_voluntary_switches
Involuntary context switches
cat /proc/PID/sched|grep nr_involuntary_switches
where PID is process ID of the process you wish to monitor.
However if you want to get these statistics by patching (creating a hook) linux source, the code related to scheduling is present in
kernel/sched/
folder of the source tree.
In particular
kernel/sched/core.c contains the schedule() function, which is the code of linux scheduler.
The code of CFS (completely fair scheduler), which is one of the several schedulers present in Linux, and is most commonly used is present in
/kernel/sched/fair.c
scheduler() is executed when ever TIF_NEED_RESCHED flag is set, so find out from which all places this flag is being set (use cscope on linux source) which will give you an insight intsight into the types of context switches occurring for a process.

Resources