where is the context switching finally happening in the linux kernel source? - linux

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.

Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

Related

Does wake_up cause a race condition?

I was looking at the wake_up function here from the linux kernel code
https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154
It's line 154
/**
* __wake_up - wake up threads blocked on a waitqueue.
* #wq_head: the waitqueue
* #mode: which threads
* #nr_exclusive: how many wake-one or wake-many threads to wake up
* #key: is directly passed to the wakeup function
*
* If this function wakes up a task, it executes a full memory barrier before
* accessing the task state.
*/
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
If it's waking up all the threads, couldn't this cause a race condition? Let's say all the threads are waiting for the same data structure or something, so once the wake_up is called, aren't all the threads racing for the same thing?

How to make mprotect() to make forward progress after handling pagefaulte exception? [duplicate]

I want to write a signal handler to catch SIGSEGV.
I protect a block of memory for read or write using
char *buffer;
char *p;
char a;
int pagesize = 4096;
mprotect(buffer,pagesize,PROT_NONE)
This protects pagesize bytes of memory starting at buffer against any reads or writes.
Second, I try to read the memory:
p = buffer;
a = *p
This will generate a SIGSEGV, and my handler will be called.
So far so good. My problem is that, once the handler is called, I want to change the access write of the memory by doing
mprotect(buffer,pagesize,PROT_READ);
and continue normal functioning of my code. I do not want to exit the function.
On future writes to the same memory, I want to catch the signal again and modify the write rights and then record that event.
Here is the code:
#include <signal.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
char *buffer;
int flag=0;
static void handler(int sig, siginfo_t *si, void *unused)
{
printf("Got SIGSEGV at address: 0x%lx\n",(long) si->si_addr);
printf("Implements the handler only\n");
flag=1;
//exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
char *p; char a;
int pagesize;
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
handle_error("sigaction");
pagesize=4096;
/* Allocate a buffer aligned on a page boundary;
initial protection is PROT_READ | PROT_WRITE */
buffer = memalign(pagesize, 4 * pagesize);
if (buffer == NULL)
handle_error("memalign");
printf("Start of region: 0x%lx\n", (long) buffer);
printf("Start of region: 0x%lx\n", (long) buffer+pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+2*pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+3*pagesize);
//if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
handle_error("mprotect");
//for (p = buffer ; ; )
if(flag==0)
{
p = buffer+pagesize/2;
printf("It comes here before reading memory\n");
a = *p; //trying to read the memory
printf("It comes here after reading memory\n");
}
else
{
if (mprotect(buffer + pagesize * 0, pagesize,PROT_READ) == -1)
handle_error("mprotect");
a = *p;
printf("Now i can read the memory\n");
}
/* for (p = buffer;p<=buffer+4*pagesize ;p++ )
{
//a = *(p);
*(p) = 'a';
printf("Writing at address %p\n",p);
}*/
printf("Loop completed\n"); /* Should never happen */
exit(EXIT_SUCCESS);
}
The problem is that only the signal handler runs and I can't return to the main function after catching the signal.
When your signal handler returns (assuming it doesn't call exit or longjmp or something that prevents it from actually returning), the code will continue at the point the signal occurred, reexecuting the same instruction. Since at this point, the memory protection has not been changed, it will just throw the signal again, and you'll be back in your signal handler in an infinite loop.
So to make it work, you have to call mprotect in the signal handler. Unfortunately, as Steven Schansker notes, mprotect is not async-safe, so you can't safely call it from the signal handler. So, as far as POSIX is concerned, you're screwed.
Fortunately on most implementations (all modern UNIX and Linux variants as far as I know), mprotect is a system call, so is safe to call from within a signal handler, so you can do most of what you want. The problem is that if you want to change the protections back after the read, you'll have to do that in the main program after the read.
Another possibility is to do something with the third argument to the signal handler, which points at an OS and arch specific structure that contains info about where the signal occurred. On Linux, this is a ucontext structure, which contains machine-specific info about the $PC address and other register contents where the signal occurred. If you modify this, you change where the signal handler will return to, so you can change the $PC to be just after the faulting instruction so it won't re-execute after the handler returns. This is very tricky to get right (and non-portable too).
edit
The ucontext structure is defined in <ucontext.h>. Within the ucontext the field uc_mcontext contains the machine context, and within that, the array gregs contains the general register context. So in your signal handler:
ucontext *u = (ucontext *)unused;
unsigned char *pc = (unsigned char *)u->uc_mcontext.gregs[REG_RIP];
will give you the pc where the exception occurred. You can read it to figure out what instruction it
was that faulted, and do something different.
As far as the portability of calling mprotect in the signal handler is concerned, any system that follows either the SVID spec or the BSD4 spec should be safe -- they allow calling any system call (anything in section 2 of the manual) in a signal handler.
You've fallen into the trap that all people do when they first try to handle signals. The trap? Thinking that you can actually do anything useful with signal handlers. From a signal handler, you are only allowed to call asynchronous and reentrant-safe library calls.
See this CERT advisory as to why and a list of the POSIX functions that are safe.
Note that printf(), which you are already calling, is not on that list.
Nor is mprotect. You're not allowed to call it from a signal handler. It might work, but I can promise you'll run into problems down the road. Be really careful with signal handlers, they're tricky to get right!
EDIT
Since I'm being a portability douchebag at the moment already, I'll point out that you also shouldn't write to shared (i.e. global) variables without taking the proper precautions.
You can recover from SIGSEGV on linux. Also you can recover from segmentation faults on Windows (you'll see a structured exception instead of a signal). But the POSIX standard doesn't guarantee recovery, so your code will be very non-portable.
Take a look at libsigsegv.
You should not return from the signal handler, as then behavior is undefined. Rather, jump out of it with longjmp.
This is only okay if the signal is generated in an async-signal-safe function. Otherwise, behavior is undefined if the program ever calls another async-signal-unsafe function. Hence, the signal handler should only be established immediately before it is necessary, and disestablished as soon as possible.
In fact, I know of very few uses of a SIGSEGV handler:
use an async-signal-safe backtrace library to log a backtrace, then die.
in a VM such as the JVM or CLR: check if the SIGSEGV occurred in JIT-compiled code. If not, die; if so, then throw a language-specific exception (not a C++ exception), which works because the JIT compiler knew that the trap could happen and generated appropriate frame unwind data.
clone() and exec() a debugger (do not use fork() – that calls callbacks registered by pthread_atfork()).
Finally, note that any action that triggers SIGSEGV is probably UB, as this is accessing invalid memory. However, this would not be the case if the signal was, say, SIGFPE.
There is a compilation problem using ucontext_t or struct ucontext (present in /usr/include/sys/ucontext.h)
http://www.mail-archive.com/arch-general#archlinux.org/msg13853.html

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel?

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel file arch/x86/kernel/entry_64.S ? Is interrupt a instruction or a macro? Where is the definition? How to use it ?
847 common_interrupt:
848 XCPT_FRAME
849 addq $-0x80,(%rsp) /* Adjust vector to [-256,-1] range */
850 interrupt do_IRQ
851 /* 0(%rsp): old_rsp-ARGOFFSET */
It's declared a short distance above:
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
/* reserve pt_regs for scratch regs and rbp */
subq $ORIG_RAX-RBP, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-RBP
call save_args
PARTIAL_FRAME 0
call \func
.endm
I don't know what that does, though. :-)
Interrupt are basically used for suspending all the current processes running on the current interrupted cpu core & then run the generated interrupt related work. & the interrupt related work is done with the handler routine or function which is registered.
Interrupt may be generated by H/W or S/W. and there are basically two types of interrupt as...1-)soft interrupt & 2-)hard interrupt.
so whenever a particular interrupt is generated its handler routine or function is called & this calling is related with the parameter passed in the function do_IRQ(struct pt_regs *regs) which is pt_regs structure type & it basically stores the registers values as...
struct pt_regs{
unsigned long r0;
unsigned long r1;
...
...
};
& for more info u can follow this link https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Hardware_interrupts.html

Is there something wrong with my spin lock?

Here is my implementation of a spin lock, but it seems it can not protect the critical code. Is there something wrong with my implementation?
static __inline__ int xchg_asm(int* lock, int val)
{
int ret;
__asm__ __volatile__(
LOCK "movl (%1),%%eax;
xchg (%1),%2;
movl %%eax, %0" :"=m" (ret) :"d"(lock), "c"(val)
);
return ret;
}
void spin_init(spinlock_t* sl)
{
sl->val = 0;
}
void spin_lock(spinlock_t* sl)
{
int ret;
do {
ret = xchg_asm(&(sl->val), 1);
} while ( ret==0 );
}
void spin_unlock(spinlock_t* sl)
{
xchg_asm(&(sl->val), 0);
}
Your code equals to:
static __inline__ int xchg_asm(int* lock, int val) {
int save_old_value_at_eax;
save_old_value_at_eax = *lock; /* with a wrong lock prefix */
xchg *lock with val and discard the original value of *lock.
return save_old_value_at_eax; /* but it not the real original value of *lock */
}
You can see from the code, save_old_value_at_eax is no the real original value while the cpu perform xchg. You should get the old/original value by the xchg instruction, not by saving it before perform xchg. ("it is not the real old/original value" means, if another CPU takes the lock after this CPU saves the value but before this CPU performs the xchg instruction, this CPU will get the wrong old value, and it think it took the lock successful, thus, two CPUs enter the C.S. at the same time). You have separated a read-modify-write instruction to three instructions, the whole three instructions are not atomically(even you move the lock prefix to xchg).
I guess you thought the lock prefix will lock the WHOLE three instructions, but actually lock prefix can only be used for the only instruction which it is attached(not all instructions can be attached)
And we don't need lock prefix on SMP for xchg. Quote from linux_kernel_src/arch/x86//include/asm/cmpxchg.h
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
* Since this is generally used to protect other memory information, we
* use "asm volatile" and "memory" clobbers to prevent gcc from moving
* information around.
*/
My suggestions:
DON'T REPEAT YOURSELF, please use the spin lock of the linux kernel.
DON'T REPEAT YOURSELF, please use the xchg(), cmpxchg() of the linux kernel if you do want to implement a spin lock.
learn more about instructions. you can also find out how the linux kernel implement it.

Writing a syscall to count context switches of a process

I have to do a system call to count the voluntary & involuntary context switches of a process. I already know the steps to add a new system call to a linux kernel but i have no clue of where i should start for the context-switch function. Any idea?
If your syscall should only report statistics, you can use context switch counting code that is already in the kernel.
wait3 syscall or getrusage syscall already reports context switch count in struct rusage fields:
struct rusage {
...
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
You can try it by running:
$ /usr/bin/time -v /bin/ls -R
....
Voluntary context switches: 1669
Involuntary context switches: 207
where "/bin/ls -R" is any program.
By searching an "struct rusage" in kernel sources, you can find this accumulate_thread_rusage in kernel/sys.c, which updates rusage struct. It reads from struct task_struct *t; the fields t->nvcsw; and t->nivcsw;:
1477 static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
1478 {
1479 r->ru_nvcsw += t->nvcsw; // <<=== here
1480 r->ru_nivcsw += t->nivcsw;
1481 r->ru_minflt += t->min_flt;
1482 r->ru_majflt += t->maj_flt;
Then you should search nvcsw and nivcsw in kernel folder to find how they are updated by kernel.
asmlinkage void __sched schedule(void):
4124 if (likely(prev != next)) { // <= if we are switching between different tasks
4125 sched_info_switch(prev, next);
4126 perf_event_task_sched_out(prev, next);
4127
4128 rq->nr_switches++;
4129 rq->curr = next;
4130 ++*switch_count; // <= increment nvcsw or nivcsw via pointer
4131
4132 context_switch(rq, prev, next); /* unlocks the rq */
Pointer switch_count is from line 4091 or line 4111 of the same file.
PS: Link from perreal is great: http://oreilly.com/catalog/linuxkernel/chapter/ch10.html (search context_swtch)
This already exists: the virtual file /proc/NNNN/status (where NNNN is the decimal process ID of the process you want to know about) contains, among other things, counts of both voluntary and involuntary context switches. Unlike getrusage this allows you to learn the context switch counts for any process, not just children. See the proc(5) manpage for more details.
A process will make a context switch in case of blocking, time quantum expiring or for interrupts etc. Eventually schedule() function is called. Since you want to count it for each process separately you have to keep a new variable for each process for counting the no of context switches. And you can update this variable each time in schedule fun for current process. Using your system call you can read this value. Here is a snippet of the schedule function of pintos,
static void
schedule (void)
{
struct thread *cur = running_thread ();
struct thread *next = next_thread_to_run ();
struct thread *prev = NULL;
ASSERT (intr_get_level () == INTR_OFF);
ASSERT (cur->status != THREAD_RUNNING);
ASSERT (is_thread (next));<br/>
if (cur != next)
prev = switch_threads (cur, next); <== here you can update count of "cur"
thread_schedule_tail (prev);
}
Total number of context switches
cat /proc/PID/sched|grep nr_switches
Voluntary context switches
cat /proc/PID/sched | grep nr_voluntary_switches
Involuntary context switches
cat /proc/PID/sched|grep nr_involuntary_switches
where PID is process ID of the process you wish to monitor.
However if you want to get these statistics by patching (creating a hook) linux source, the code related to scheduling is present in
kernel/sched/
folder of the source tree.
In particular
kernel/sched/core.c contains the schedule() function, which is the code of linux scheduler.
The code of CFS (completely fair scheduler), which is one of the several schedulers present in Linux, and is most commonly used is present in
/kernel/sched/fair.c
scheduler() is executed when ever TIF_NEED_RESCHED flag is set, so find out from which all places this flag is being set (use cscope on linux source) which will give you an insight intsight into the types of context switches occurring for a process.

Resources