Writing a syscall to count context switches of a process - linux

I have to do a system call to count the voluntary & involuntary context switches of a process. I already know the steps to add a new system call to a linux kernel but i have no clue of where i should start for the context-switch function. Any idea?

If your syscall should only report statistics, you can use context switch counting code that is already in the kernel.
wait3 syscall or getrusage syscall already reports context switch count in struct rusage fields:
struct rusage {
...
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
You can try it by running:
$ /usr/bin/time -v /bin/ls -R
....
Voluntary context switches: 1669
Involuntary context switches: 207
where "/bin/ls -R" is any program.
By searching an "struct rusage" in kernel sources, you can find this accumulate_thread_rusage in kernel/sys.c, which updates rusage struct. It reads from struct task_struct *t; the fields t->nvcsw; and t->nivcsw;:
1477 static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
1478 {
1479 r->ru_nvcsw += t->nvcsw; // <<=== here
1480 r->ru_nivcsw += t->nivcsw;
1481 r->ru_minflt += t->min_flt;
1482 r->ru_majflt += t->maj_flt;
Then you should search nvcsw and nivcsw in kernel folder to find how they are updated by kernel.
asmlinkage void __sched schedule(void):
4124 if (likely(prev != next)) { // <= if we are switching between different tasks
4125 sched_info_switch(prev, next);
4126 perf_event_task_sched_out(prev, next);
4127
4128 rq->nr_switches++;
4129 rq->curr = next;
4130 ++*switch_count; // <= increment nvcsw or nivcsw via pointer
4131
4132 context_switch(rq, prev, next); /* unlocks the rq */
Pointer switch_count is from line 4091 or line 4111 of the same file.
PS: Link from perreal is great: http://oreilly.com/catalog/linuxkernel/chapter/ch10.html (search context_swtch)

This already exists: the virtual file /proc/NNNN/status (where NNNN is the decimal process ID of the process you want to know about) contains, among other things, counts of both voluntary and involuntary context switches. Unlike getrusage this allows you to learn the context switch counts for any process, not just children. See the proc(5) manpage for more details.

A process will make a context switch in case of blocking, time quantum expiring or for interrupts etc. Eventually schedule() function is called. Since you want to count it for each process separately you have to keep a new variable for each process for counting the no of context switches. And you can update this variable each time in schedule fun for current process. Using your system call you can read this value. Here is a snippet of the schedule function of pintos,
static void
schedule (void)
{
struct thread *cur = running_thread ();
struct thread *next = next_thread_to_run ();
struct thread *prev = NULL;
ASSERT (intr_get_level () == INTR_OFF);
ASSERT (cur->status != THREAD_RUNNING);
ASSERT (is_thread (next));<br/>
if (cur != next)
prev = switch_threads (cur, next); <== here you can update count of "cur"
thread_schedule_tail (prev);
}

Total number of context switches
cat /proc/PID/sched|grep nr_switches
Voluntary context switches
cat /proc/PID/sched | grep nr_voluntary_switches
Involuntary context switches
cat /proc/PID/sched|grep nr_involuntary_switches
where PID is process ID of the process you wish to monitor.
However if you want to get these statistics by patching (creating a hook) linux source, the code related to scheduling is present in
kernel/sched/
folder of the source tree.
In particular
kernel/sched/core.c contains the schedule() function, which is the code of linux scheduler.
The code of CFS (completely fair scheduler), which is one of the several schedulers present in Linux, and is most commonly used is present in
/kernel/sched/fair.c
scheduler() is executed when ever TIF_NEED_RESCHED flag is set, so find out from which all places this flag is being set (use cscope on linux source) which will give you an insight intsight into the types of context switches occurring for a process.

Related

where is the context switching finally happening in the linux kernel source?

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.
Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

What is the purpose of putting a thread on a wait queue with a condition when only one thread is allowed to enter?

On this request
ssize_t foo_read(struct file *filp, char *buf, size_t count,loff_t *ppos)
{
foo_dev_t * foo_dev = filp->private_data;
if (down_interruptible(&foo_dev->sem)
return -ERESTARTSYS;
foo_dev->intr = 0;
outb(DEV_FOO_READ, DEV_FOO_CONTROL_PORT);
wait_event_interruptible(foo_dev->wait, (foo_dev->intr= =1));
if (put_user(foo_dev->data, buf))
return -EFAULT;
up(&foo_dev->sem);
return 1;
}
With this completion
irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
foo->data = inb(DEV_FOO_DATA_PORT);
foo->intr = 1;
wake_up_interruptible(&foo->wait);
return 1;
}
Assuming foo_dev->sem is initially 1 then only one thread is allowed to execute the section after down_interruptible(&foo_dev->sem) and threads waiting for that semaphore make sense to be put in a queue.(As i understand making foo_dev->sem greater than one will be a problem in that code).
So if only one passes always whats the use of foo_dev->wait queue, isnt it possible to suspend the current thread, save its pointer as a global *curr and wake it up when it completes its request?
Yes, it is possible to put single thread to wait (using set_current_state() and schedule()) and resume it later (using wake_up_process).
But this requires writing some code for check wakeup conditions and possible absent of a thread to wakeup.
Waitqueues provide ready-made functions and macros for wait on condition and wakeup it later, so resulted code becomes much shorter: single macro wait_event_interruptible() processes checking for event and putting thread to sleep, and single macro wake_up_interruptible() processes resuming possibly absent thread.

ptrace one thread from another

Experimenting with the ptrace() system call, I am trying to trace another thread of the same process. According to the man page, both the tracer and the tracee are specific threads (not processes), so I don't see a reason why it should not work. So far, I have tried the following:
use PTRACE_TRACEME from the clone()d child: the call succeeds, but does not do what I want, probably because the parent of the to-be-traced thread is not the thread that called clone()
use PTRACE_ATTACH or PTRACE_SEIZE from the parent thread: this always fails with EPERM, even if the process runs as root and with prctl(PR_SET_DUMPABLE, 1)
In all cases, waitpid(-1, &status, __WALL) fails with ECHILD (same when passing the child pid explicitly).
What should I do to make it work?
If it is not possible at all, is it by desing or a bug in the kernel (I am using version 3.8.0). In the former case, could you point me to the right bit of the documentation?
As #mic_e pointed out, this is a known fact about the kernel - not quite a bug, but not quite correct either. See the kernel mailing list thread about it. To provide an excerpt from Linus Torvalds:
That "new" (last November) check isn't likely going away. It solved
so many problems (both security and stability), and considering that
(a) in a year, only two people have ever even noticed
(b) there's a work-around as per above that isn't horribly invasive
I have to say that in order to actually go back to the old behaviour,
we'd have to have somebody who cares deeply, go back and check every
single special case, deadlock, and race.
The solution is to actually start the process that is being traced in a subprocess - you'll need to make the ptracing process be the parent of the other.
Here's an outline of doing this based on another answer that I wrote:
// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)
int main_thread(void *ptr) {
// do work for main thread
}
int main(int argc, char *argv[]) {
void *vstack = malloc(STACK_SIZE);
pid_t v;
if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
perror("failed to spawn child task");
return 3;
}
long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
if (ptv == -1) {
perror("failed monitor sieze");
return 1;
}
// do actual ptrace work
}

How is run queue length computed in linux proc filesystem

I'm trying to obtain number of runnable processes from linux kernel. sar -q gives this information readily. However I'm trying to get this value from /proc filesystem. There is no file in /proc that gives this value directly, then how is runq-sz computed.
The wiki page http://en.wikipedia.org/wiki/Load_(computing) provides some insight into how run queue length is computed based on ldavg values but it is unclear.
Can someone provide more pointers on this. Cheers
As gcla said you cat use
cat /proc/loadavg
to read loadavarage from from kernel - but strictly speaking, it is not a queue length.
Take a look at
grep procs_running /proc/stat
and
grep procs_blocked /proc/stat
First is an actual running queue and second is a number of process blocked on disk IO. Load average is a function from sum of both.
here is the function in the sysstat daemon which provides the info sar prints out:
https://github.com/sysstat/sysstat/blob/master/rd_stats.c#L392
if ((fp = fopen(LOADAVG, "r")) == NULL)
return;
/* Read load averages and queue length */
fscanf(fp, "%d.%d %d.%d %d.%d %ld/%d %*d\n",
&load_tmp[0], &st_queue->load_avg_1,
&load_tmp[1], &st_queue->load_avg_5,
&load_tmp[2], &st_queue->load_avg_15,
&st_queue->nr_running,
&st_queue->nr_threads);
It reads from /proc/loadavg, which is populated by this kernel function
http://lxr.free-electrons.com/source/fs/proc/loadavg.c#L13
static int loadavg_proc_show(struct seq_file *m, void *v)
{
unsigned long avnrun[3];
get_avenrun(avnrun, FIXED_1/200, 0);
seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
nr_running(), nr_threads,
task_active_pid_ns(current)->last_pid);
return 0;
}
The nr_running() function provides the total of both currently running tasks and tasks that are ready to run on a CPU; it's an instantaneous measure. I believe this will line up with the sar runq-sz variable.
Graham

Prevent file descriptors inheritance during Linux fork

How do you prevent a file descriptor from being copy-inherited across fork() system calls (without closing it, of course)?
I am looking for a way to mark a single file descriptor as NOT to be (copy-)inherited by children at fork(), something like a FD_CLOEXEC-like hack but for forks (so a FD_DONTINHERIT feature if you like). Anybody did this? Or looked into this and has a hint for me to start with?
Thank you
UPDATE:
I could use libc's __register_atfork
__register_atfork(NULL, NULL, fdcleaner, NULL)
to close the fds in child just before fork() returns. However, the FDs are still being copied so this sounds like a silly hack to me. Question is how to skip the dup()-ing in child of unneeded FDs.
I'm thinking of some scenarios when a fcntl(fd, F_SETFL, F_DONTINHERIT) would be needed:
fork() will copy an event FD (e.g. epoll()); sometimes this isn't wanted, for example FreeBSD is marking the kqueue() event FD as being of a KQUEUE_TYPE and these types of FDs won't be copied across forks (the kqueue FDs are skipped explicitly from being copied, if one wants to use it from a child it must fork with shared FD table)
fork() will copy 100k unneeded FDs to fork a child for doing some CPU-intensive tasks (suppose the need for a fork() is probabilistically very low and programmer won't want to maintain a pool of children for something that normally wouldn't happen)
Some descriptors we want to be copied (0, 1, 2), some (most of them?) not. I think full FD table duping is here for historic reasons but I am probably wrong.
How silly does this sound:
patch fcntl() to support the dontinherit flag on file descriptors (not sure if the flag should be kept per-FD or in a FD table fd_set, like the close-on-exec flags are being kept
modify dup_fd() in kernel to skip copying of dontinherit FDs, same as FreeBSD does for kq FDs
consider the program
#include <stdio.h>
#include <unistd.h>
#include <err.h>
#include <stdlib.h>
#include <fcntl.h>
#include <time.h>
static int fds[NUMFDS];
clock_t t1;
static void cleanup(int i)
{
while(i-- >= 0) close(fds[i]);
}
void clk_start(void)
{
t1 = clock();
}
void clk_end(void)
{
double tix = (double)clock() - t1;
double sex = tix/CLOCKS_PER_SEC;
printf("fork_cost(%d fds)=%fticks(%f seconds)\n",
NUMFDS,tix,sex);
}
int main(int argc, char **argv)
{
pid_t pid;
int i;
__register_atfork(clk_start,clk_end,NULL,NULL);
for (i = 0; i < NUMFDS; i++) {
fds[i] = open("/dev/null",O_RDONLY);
if (fds[i] == -1) {
cleanup(i);
errx(EXIT_FAILURE,"open_fds:");
}
}
t1 = clock();
pid = fork();
if (pid < 0) {
errx(EXIT_FAILURE,"fork:");
}
if (pid == 0) {
cleanup(NUMFDS);
exit(0);
} else {
wait(&i);
cleanup(NUMFDS);
}
exit(0);
return 0;
}
of course, can't consider this a real bench but anyhow:
root#pinkpony:/home/cia/dev/kqueue# time ./forkit
fork_cost(100 fds)=0.000000ticks(0.000000 seconds)
real 0m0.004s
user 0m0.000s
sys 0m0.000s
root#pinkpony:/home/cia/dev/kqueue# gcc -DNUMFDS=100000 -o forkit forkit.c
root#pinkpony:/home/cia/dev/kqueue# time ./forkit
fork_cost(100000 fds)=10000.000000ticks(0.010000 seconds)
real 0m0.287s
user 0m0.010s
sys 0m0.240s
root#pinkpony:/home/cia/dev/kqueue# gcc -DNUMFDS=100 -o forkit forkit.c
root#pinkpony:/home/cia/dev/kqueue# time ./forkit
fork_cost(100 fds)=0.000000ticks(0.000000 seconds)
real 0m0.004s
user 0m0.000s
sys 0m0.000s
forkit ran on a Dell Inspiron 1520 Intel(R) Core(TM)2 Duo CPU T7500 # 2.20GHz with 4GB RAM; average_load=0.00
If you fork with the purpose of calling an exec function, you can use fcntl with FD_CLOEXEC to have the file descriptor closed once you exec:
int fd = open(...);
fcntl(fd, F_SETFD, FD_CLOEXEC);
Such a file descriptor will survive a fork but not functions of the exec family.
No. Close them yourself, since you know which ones need to be closed.
There's no standard way of doing this to my knowledge.
If you're looking to implement it properly, probably the best way to do it would be to add a system call to mark the file descriptor as close-on-fork, and to intercept the sys_fork system call (syscall number 2) to act on those flags after calling the original sys_fork.
If you don't want to add a new system call, you might be able to get away with intercepting sys_ioctl (syscall number 54) and just adding a new command to it for marking a file description close-on-fork.
Of course, if you can control what your application is doing, then it might be better to maintain user-level tables of all file descriptors you want closed on fork and call your own myfork instead. This would fork, then go through the user-level table closing those file descriptors so marked.
You wouldn't have to fiddle around in the Linux kernel then, a solution that's probably only necessary if you don't have control over the fork process (say, if a third party library is doing the fork() calls).

Resources