How could sys_sigsuspend is atomical in linux kernel 2.6.11? - linux

I'm reading linux 2.6.11
the implementation of sys_sigsuspend is as the following
34 /*
35 * Atomically swap in the new signal mask, and wait for a signal.
36 */
37 asmlinkage int
38 sys_sigsuspend(int history0, int history1, old_sigset_t mask)
39 {
40 struct pt_regs * regs = (struct pt_regs *) &history0;
41 sigset_t saveset;
42
43 mask &= _BLOCKABLE;
44 spin_lock_irq(&current->sighand->siglock);
45 saveset = current->blocked;
46 siginitset(&current->blocked, mask);
47 recalc_sigpending();
48 spin_unlock_irq(&current->sighand->siglock);
49
50 regs->eax = -EINTR;
51 while (1) {
52 current->state = TASK_INTERRUPTIBLE;
53 schedule();
54 if (do_signal(regs, &saveset))
55 return -EINTR;
56 }
57 }
in ULK3 the author says
the sigsuspend( ) system call does not allow signals to be sent after unblocking and before the schedule( ) invocation, because other processes cannot grab the CPU during that time interval.
Between spin_unlock_irq and schedule the syscall can be interrupted and preempted, so the other process can have enough time to send a signal which is not blocked to the process
But in this case, the signal will be lost, because the process schedule after the signal is delivered.
That's why sigsuspend should be atomical, but it's NOT according the its implementation.

sigsuspend implementation is correct, but the explanation in ULK is seems to be misleading.
When process executes kernel code, that execution is never interrupted by the user's signals. Instead, such signals are accumulated inside current task structure. At the moment the process leaves kernel code and returns to the user one, all signals accumulated(and not blocked) are fired.
schedule() kernel's function checks, whether some signals are accumulated. If they are, and current->state is TASK_INTERRUPTIBLE, schedule() returns. So all signals collected before schedule() call are not lost.
Atomicity of sigsuspend() system call means that if signals, temporary unblocked by the call, are emitted, then that call will garantee see them and return. Such atomicity is simply achived by placing both unblocking and checking signals inside same kernel function.

Related

Does wake_up cause a race condition?

I was looking at the wake_up function here from the linux kernel code
https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154
It's line 154
/**
* __wake_up - wake up threads blocked on a waitqueue.
* #wq_head: the waitqueue
* #mode: which threads
* #nr_exclusive: how many wake-one or wake-many threads to wake up
* #key: is directly passed to the wakeup function
*
* If this function wakes up a task, it executes a full memory barrier before
* accessing the task state.
*/
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
If it's waking up all the threads, couldn't this cause a race condition? Let's say all the threads are waiting for the same data structure or something, so once the wake_up is called, aren't all the threads racing for the same thing?

where is the context switching finally happening in the linux kernel source?

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.
Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

Is Entrance into a Windows Critical Section an atomic operation?

I wrote an FFI for critical sections, and I wrote a test for it in Haxe.
Tests run in order defined (public functions are tests)
This test test_critical_section will intermittently hang and fail:
1 var criticalSection:CriticalSection;
2
3 #if master
4 public function test_init_critical_section() {
5 return assert(attempt({
6 criticalSection = synch.SynchLib.critical_section_init(SPIN_COUNT);
7 trace('criticalSection: $criticalSection');
8 }));
9 }
10 var criticalValue = 0;
11 var done = 0;
12 var numThreads = 50;
13 function work_in_critical_section(ID:Int, a:AssertionBuffer) {
14 sys.thread.Thread.create(() -> {
15 inline function threadMsg(msg:String)
16 trace('Thread ID $ID: $msg');
17
18
19 threadMsg("Attempting to enter critical section");
20 criticalSection.critical_section_enter();
21 threadMsg("Entering crtiical section. Doing work.");
22 Sys.sleep(Std.random(100)/500); // simulate work in section
23 criticalValue+= 10;
24 done++;
25 a.assert(criticalValue == done * 10);
26 threadMsg("Leaving critical section. Work done. done: " + done);
27 criticalSection.critical_section_leave();
28 if (done == numThreads) {
29 a.assert(criticalValue == numThreads * 10);
30 a.done();
31
32 }
33 });
34 }
35 #:timeout(30000)
36 public function test_critical_section() {
37 var a = new AssertionBuffer();
38 for (i in 0...numThreads)
39 work_in_critical_section(i, a);
40 return a;
41 }
But when I add Sys.sleep(ID/5); just before entrance into the critical section (on the blank line 18), the test passses every single time (with any number of threads). Without it, the test fails randomly (more often with a higher number of threads).
My conclusion from this test is that entrance to a critical section is not atomic, and multiple threads simultaneously attempting to enter may leave the critical section in an undefined state (leading to undefined/hanging behavior).
Is this the right conclusion or am I simply mis-using critical sections (and thus, the test needs to be re-written)? And if it is the right conclusion.. does this not mean that entrance into the critical section needs its own atomic locking/synchronization mechanism..? (and further, if that is the case.. what is the point of critical sections, why would I not just use whatever that atomic synchronization mechanism is?)
To me, this seems problematic, for example, consider 10 threads meet at a synchronization barrier (with a capacity of 10), and then all 10 need to proceed through a critical section immediately after the 10th thread arrives, does that mean I'd have to synchronize/serialize access to the critical section entrance method (for instance, by sleeping such as to ensure only one thread attempts to enter the section at a given tick, as done to fix the failing test above)?
The FFI is writen ontop of synchapi.h (see EnterCriticalSection)
You read done outside the critical section. That is a race condition. If you want to look at the value of done, you need to do it before you leave the critical section.
You might see a write to done from another thread, triggering the assert before the write to criticalValue is visible to the thread that saw the write to done.
If the critical section protects criticalValue and done, then it is an error to access either of them without being in the critical section unless you are sure every thread that might access them has terminated. Your code violates this rule.

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel?

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel file arch/x86/kernel/entry_64.S ? Is interrupt a instruction or a macro? Where is the definition? How to use it ?
847 common_interrupt:
848 XCPT_FRAME
849 addq $-0x80,(%rsp) /* Adjust vector to [-256,-1] range */
850 interrupt do_IRQ
851 /* 0(%rsp): old_rsp-ARGOFFSET */
It's declared a short distance above:
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
/* reserve pt_regs for scratch regs and rbp */
subq $ORIG_RAX-RBP, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-RBP
call save_args
PARTIAL_FRAME 0
call \func
.endm
I don't know what that does, though. :-)
Interrupt are basically used for suspending all the current processes running on the current interrupted cpu core & then run the generated interrupt related work. & the interrupt related work is done with the handler routine or function which is registered.
Interrupt may be generated by H/W or S/W. and there are basically two types of interrupt as...1-)soft interrupt & 2-)hard interrupt.
so whenever a particular interrupt is generated its handler routine or function is called & this calling is related with the parameter passed in the function do_IRQ(struct pt_regs *regs) which is pt_regs structure type & it basically stores the registers values as...
struct pt_regs{
unsigned long r0;
unsigned long r1;
...
...
};
& for more info u can follow this link https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Hardware_interrupts.html

Writing a syscall to count context switches of a process

I have to do a system call to count the voluntary & involuntary context switches of a process. I already know the steps to add a new system call to a linux kernel but i have no clue of where i should start for the context-switch function. Any idea?
If your syscall should only report statistics, you can use context switch counting code that is already in the kernel.
wait3 syscall or getrusage syscall already reports context switch count in struct rusage fields:
struct rusage {
...
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
You can try it by running:
$ /usr/bin/time -v /bin/ls -R
....
Voluntary context switches: 1669
Involuntary context switches: 207
where "/bin/ls -R" is any program.
By searching an "struct rusage" in kernel sources, you can find this accumulate_thread_rusage in kernel/sys.c, which updates rusage struct. It reads from struct task_struct *t; the fields t->nvcsw; and t->nivcsw;:
1477 static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
1478 {
1479 r->ru_nvcsw += t->nvcsw; // <<=== here
1480 r->ru_nivcsw += t->nivcsw;
1481 r->ru_minflt += t->min_flt;
1482 r->ru_majflt += t->maj_flt;
Then you should search nvcsw and nivcsw in kernel folder to find how they are updated by kernel.
asmlinkage void __sched schedule(void):
4124 if (likely(prev != next)) { // <= if we are switching between different tasks
4125 sched_info_switch(prev, next);
4126 perf_event_task_sched_out(prev, next);
4127
4128 rq->nr_switches++;
4129 rq->curr = next;
4130 ++*switch_count; // <= increment nvcsw or nivcsw via pointer
4131
4132 context_switch(rq, prev, next); /* unlocks the rq */
Pointer switch_count is from line 4091 or line 4111 of the same file.
PS: Link from perreal is great: http://oreilly.com/catalog/linuxkernel/chapter/ch10.html (search context_swtch)
This already exists: the virtual file /proc/NNNN/status (where NNNN is the decimal process ID of the process you want to know about) contains, among other things, counts of both voluntary and involuntary context switches. Unlike getrusage this allows you to learn the context switch counts for any process, not just children. See the proc(5) manpage for more details.
A process will make a context switch in case of blocking, time quantum expiring or for interrupts etc. Eventually schedule() function is called. Since you want to count it for each process separately you have to keep a new variable for each process for counting the no of context switches. And you can update this variable each time in schedule fun for current process. Using your system call you can read this value. Here is a snippet of the schedule function of pintos,
static void
schedule (void)
{
struct thread *cur = running_thread ();
struct thread *next = next_thread_to_run ();
struct thread *prev = NULL;
ASSERT (intr_get_level () == INTR_OFF);
ASSERT (cur->status != THREAD_RUNNING);
ASSERT (is_thread (next));<br/>
if (cur != next)
prev = switch_threads (cur, next); <== here you can update count of "cur"
thread_schedule_tail (prev);
}
Total number of context switches
cat /proc/PID/sched|grep nr_switches
Voluntary context switches
cat /proc/PID/sched | grep nr_voluntary_switches
Involuntary context switches
cat /proc/PID/sched|grep nr_involuntary_switches
where PID is process ID of the process you wish to monitor.
However if you want to get these statistics by patching (creating a hook) linux source, the code related to scheduling is present in
kernel/sched/
folder of the source tree.
In particular
kernel/sched/core.c contains the schedule() function, which is the code of linux scheduler.
The code of CFS (completely fair scheduler), which is one of the several schedulers present in Linux, and is most commonly used is present in
/kernel/sched/fair.c
scheduler() is executed when ever TIF_NEED_RESCHED flag is set, so find out from which all places this flag is being set (use cscope on linux source) which will give you an insight intsight into the types of context switches occurring for a process.

Resources