why schedule() does not lead to deadlock while using the default prepare_arch_switch() - linux

In Linux 2.6.11.12, before the shedule() function to select the "next" task to run, it will lock the runqueue
spin_lock_irq(&rq->lock);
and the, before calling context_switch() to perform the context switching, it will call prepare_arch_switch(), which is a no-op by default:
/*
* Default context-switch locking:
*/
#ifndef prepare_arch_switch
# define prepare_arch_switch(rq, next) do { } while (0)
# define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock)
# define task_running(rq, p) ((rq)->curr == (p))
#endif
that is, it will hold the rq->lock until switch_to() return, and then, the macro finish_arch_switch() actually releases the lock.
Suppose that, there are tasks A, B, and C. And now A calls schedule() and switch to B (now, the rq->lock is locked). Sooner or later, B calls schedule(). At this point, how would B to get rq->lock since it is locked by A?
There is also some arch-dependent implememtation, such as:
/*
* On IA-64, we don't want to hold the runqueue's lock during the low-level context-switch,
* because that could cause a deadlock. Here is an example by Erich Focht:
*
* Example:
* CPU#0:
* schedule()
* -> spin_lock_irq(&rq->lock)
* -> context_switch()
* -> wrap_mmu_context()
* -> read_lock(&tasklist_lock)
*
* CPU#1:
* sys_wait4() or release_task() or forget_original_parent()
* -> write_lock(&tasklist_lock)
* -> do_notify_parent()
* -> wake_up_parent()
* -> try_to_wake_up()
* -> spin_lock_irq(&parent_rq->lock)
*
* If the parent's rq happens to be on CPU#0, we'll wait for the rq->lock
* of that CPU which will not be released, because there we wait for the
* tasklist_lock to become available.
*/
#define prepare_arch_switch(rq, next) \
do { \
spin_lock(&(next)->switch_lock); \
spin_unlock(&(rq)->lock); \
} while (0)
#define finish_arch_switch(rq, prev) spin_unlock_irq(&(prev)->switch_lock)
In this case, I'm very sure that this version will do things right since it unlock the rq->lock before calling context_switch().
But what happens to the default implementation? How it can do things right?

I found a comment in context_switch() of linux 2.6.32.68, that tells the story under the code:
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
yet we don't switch to another task with the lock locked, the next task will unlock it, and if the next task is newly created, the function ret_from_fork() will also eventually call finish_task_switch() to unlock the rq->lock

Related

How to resolve this mistake in Petersons algorithm for process synchronization

Information
I was reading the book of E. Tanenbaum about Modern operating systems and there was a code snippet that was introducing Petersons algorithm for process synchronization which is implemented with software.
Here's the snippet.
```
#define FALSE 0
#define TRUE 1
#define N 2 /* number of processes */
int turn; /* whose turn is it? */
int interested[N]; /* all values initially 0 (FALSE) */
void enter_region(int process) /* process is 0 or 1 */
{
int other; /* number of the other process */
other = 1 − process; /* the opposite of process */
interested[process] = TRUE; /* show that you are interested */
turn = process; /*set flag*/
while (turn == process && interested[other] == TRUE); /* null statement */
}
void leave_region(int process) { /* process: who is leaving */
interested[process] = FALSE; /* indicate departure from critical region */
}
```
The question is
Isn't there a mistake? [Edit] Must'nt it be turn = other or maybe there is another mistake.
This version of algorithm violates rules of mutual exclusion.
[Edit]
I think this version is violating the rules of mutual exclusion. As if first process sets the interested variable than stops and other process runs, second process can idle wait after setting his interested and turn variables without any need as there is no any process in critical section.
Any answer and help is appreciated. Thanks!
If process 0 sets interested[0], and then process 1 runs enter_region up until the loop, then process 0 will be able to exit the loop because turn == process is no longer true for it. turn, in this case, really means "turn to wait", and protects against exactly the situation you described. In contrast, if the code did turn = other then process 0 would not be able to exit the loop until process 1 started waiting.

Does wake_up cause a race condition?

I was looking at the wake_up function here from the linux kernel code
https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154
It's line 154
/**
* __wake_up - wake up threads blocked on a waitqueue.
* #wq_head: the waitqueue
* #mode: which threads
* #nr_exclusive: how many wake-one or wake-many threads to wake up
* #key: is directly passed to the wakeup function
*
* If this function wakes up a task, it executes a full memory barrier before
* accessing the task state.
*/
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
If it's waking up all the threads, couldn't this cause a race condition? Let's say all the threads are waiting for the same data structure or something, so once the wake_up is called, aren't all the threads racing for the same thing?

where is the context switching finally happening in the linux kernel source?

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.
Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

Elixir: Scheduled jobs not running Mix task after the first call

I'm using Quantum to handle cron jobs. The setting is the following:
application.ex
def start
...
children = [
...
worker(MyApp.Scheduler, [])
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
config.exs
config :My_app, MyApp.Scheduler,
jobs: [
{"*/5 * * * *", fn -> Mix.Task.run "first_mix_task" end},
{"*/5 * * * *", fn -> Mix.Task.run "second_mix_task" end},
{"*/5 * * * *", fn -> Mix.Task.run "third_mix_task" end},
{"*/5 * * * *", fn -> Mix.Task.run "fourth_mix_task" end}
]
The problem is, for some reason, Mix tasks run only the first time after cron jobs are added. Later, although I can see in the logs crons are started and ended (according to Quantum), Mix tasks are never triggered.
I'm not including the mix tasks here because they work fine the first run and also when called from console. So I think the issue has to be in the settings I'm including here. But if you have a good reason to look there just let me know.
Mix.Task.run/1 only executes a task the first time it's called, unless it is re-enabled.
Runs a task with the given args.
If the task was not yet invoked, it runs the task and returns the
result.
If there is an alias with the same name, the alias will be invoked instead of the original task.
If the task or alias were already invoked, it does not run them again
and simply aborts with :noop.
https://hexdocs.pm/mix/Mix.Task.html#run/2
You can use Mix.Task.rerun/1 instead of Mix.Task.run/1 to re-enable and invoke the task again:
...
{"*/5 * * * *", fn -> Mix.Task.rerun "first_mix_task" end},
...

mutex unlocking and request_module() behaviour

I've observed the following code pattern in the Linux kernel, for example net/sched/act_api.c or many other places as well :
rtnl_lock();
rtnetlink_rcv_msg(skb, ...);
replay:
ret = process_msg(skb);
...
/* try to obtain symbol which is in module. */
/* if fail, try to load the module, otherwise use the symbol */
a = get_symbol();
if (a == NULL) {
rtnl_unlock();
request_module();
rtnl_lock();
/* now verify that we can obtain symbols from requested module and return EAGAIN.*/
a = get_symbol();
module_put();
return -EAGAIN;
}
...
if (ret == -EAGAIN)
goto replay;
...
rtnl_unlock();
After request_module has succeeded, the symbol we are interested in, becomes available in kernel memory space, and we can use it. However I don't understand why return EAGAIN and re-read the symbol, why can't just continue right after request_module()?
If you look at the current implementation in the Linux kernel, there is a comment right after the 2nd call equivalent to get_symbol() in your above code (it is tc_lookup_action_n()) that explains exactly why:
rtnl_unlock();
request_module("act_%s", act_name);
rtnl_lock();
a_o = tc_lookup_action_n(act_name);
/* We dropped the RTNL semaphore in order to
* perform the module load. So, even if we
* succeeded in loading the module we have to
* tell the caller to replay the request. We
* indicate this using -EAGAIN.
*/
if (a_o != NULL) {
err = -EAGAIN;
goto err_mod;
}
Even though the module could be requested and loaded, since the semaphore was dropped in order to load the module which is an operation that can sleep (and is not the "standard way" this function is executed, the function returns EAGAIN to signal it.
EDIT for clarification:
If we look at the call sequence when a new action is added (which could cause a required module to be loaded) we have this sequence: tc_ctl_action() -> tcf_action_add() -> tcf_action_init() -> tcf_action_init_1().
Now if "move back" the EAGAIN error back up to tc_ctl_action() in the case RTM_NEWACTION:, we see that with the EAGAIN ret value the call to tcf_action_add is repeated.

Resources