I was looking at the wake_up function here from the linux kernel code
https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154
It's line 154
/**
* __wake_up - wake up threads blocked on a waitqueue.
* #wq_head: the waitqueue
* #mode: which threads
* #nr_exclusive: how many wake-one or wake-many threads to wake up
* #key: is directly passed to the wakeup function
*
* If this function wakes up a task, it executes a full memory barrier before
* accessing the task state.
*/
void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
If it's waking up all the threads, couldn't this cause a race condition? Let's say all the threads are waiting for the same data structure or something, so once the wake_up is called, aren't all the threads racing for the same thing?
Related
In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.
Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.
On this request
ssize_t foo_read(struct file *filp, char *buf, size_t count,loff_t *ppos)
{
foo_dev_t * foo_dev = filp->private_data;
if (down_interruptible(&foo_dev->sem)
return -ERESTARTSYS;
foo_dev->intr = 0;
outb(DEV_FOO_READ, DEV_FOO_CONTROL_PORT);
wait_event_interruptible(foo_dev->wait, (foo_dev->intr= =1));
if (put_user(foo_dev->data, buf))
return -EFAULT;
up(&foo_dev->sem);
return 1;
}
With this completion
irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
foo->data = inb(DEV_FOO_DATA_PORT);
foo->intr = 1;
wake_up_interruptible(&foo->wait);
return 1;
}
Assuming foo_dev->sem is initially 1 then only one thread is allowed to execute the section after down_interruptible(&foo_dev->sem) and threads waiting for that semaphore make sense to be put in a queue.(As i understand making foo_dev->sem greater than one will be a problem in that code).
So if only one passes always whats the use of foo_dev->wait queue, isnt it possible to suspend the current thread, save its pointer as a global *curr and wake it up when it completes its request?
Yes, it is possible to put single thread to wait (using set_current_state() and schedule()) and resume it later (using wake_up_process).
But this requires writing some code for check wakeup conditions and possible absent of a thread to wakeup.
Waitqueues provide ready-made functions and macros for wait on condition and wakeup it later, so resulted code becomes much shorter: single macro wait_event_interruptible() processes checking for event and putting thread to sleep, and single macro wake_up_interruptible() processes resuming possibly absent thread.
I have a double barrier multi-thread program working, but I don't know how to create a fair mechanism (using POSIX mutex, conditional variable barrier functions) -
meaning: groups of threads will enter the first barrier by arrival time to barrier.
Pseodo code for the code I have till now (summarized, original code has more validations. Hope it's clear enough) -
mutex_lock;
++_barrier->m_predicate;
/* block all threads ( except last at thread) -
pending in barrier rendezvous point */
if(_barrier->m_predicate != _barrier->m_barrierSize)
{
pthread_cond_wait(&_barrier->m_cond, &_barrier->m_mutex);
}
else
{
/* *Unblock all threads (by scheduling policy order)
that are currently blocked by cond parameter in Barrier
**Reset: Predicate value is "0" --> new batch of threads
enter 1st barrier */
pthread_cond_broadcast (&_barrier->m_cond);
ResetBarrier (_barrier);
}
/* end of critical code block */
pthread_mutex_unlock(&_barrier->m_mutex);
As stated in https://lwn.net/Articles/308545/ hrtimer callbacks run in hard interrupt context with irqs disabled.
But what about a SMP?
Can a second callback for another hrtimer run on another core,
while a first callback is allready running or do they exclude each other on all cores, so that no locking is needed between them?
edit:
When a handler for a "regular" hardware IRQ (let's call it X) is running on a core, all IRQs are disabled on only that core, but IRQ X is disabled on the whole system, so two handlers for X never run concurrently.
How do hrtimer interrupts behave in this regard?
Do they all share the same quasi IRQ, or is there one IRQ per hrtimer?
edit:
Did some experiments with two timers A and B:
// starting timer A to fire as fast as possible...
A_ktime = ktime_set(0, 1); // 1 NS
hrtimer_start( &A, A_ktime, HRTIMER_MODE_REL );
// starting timer B to fire 10 us later
B_ktime = ktime_set(0, 10000); // 10 us
hrtimer_start( &B, B_ktime, HRTIMER_MODE_REL );
Put some printks into the callbacks and a huge delay into the one for timer A
// fired after 1 NS
enum hrtimer_restart A(struct hrtimer *timer)
{
printk("timer A: %lu\n",jiffies);
int i;
for(i=0;i<10000;i++){ // delay 10 seconds (1000 jiffies with HZ 100)
udelay(1000);
}
printk("end timer A: %lu\n",jiffies);
return HRTIMER_NORESTART;
}
// fired after 10 us
enum hrtimer_restart B(struct hrtimer *timer)
{
printk("timer B: %lu\n",jiffies);
return HRTIMER_NORESTART;
}
Result was reproducible something like
[ 6.217393] timer A: 4294937914
[ 16.220352] end timer A: 4294938914
[ 16.224059] timer B: 4294938915
1000 jiffies after start of timer A,
when timer B was setup to fire after less than one jiffie after it.
When driving this further and increasing the delay to 70 seconds,
I got 7000 jiffies between start of timer A callback and timer B callback.
[ 6.218258] timer A: 4294937914
[ 76.220058] end timer A: 4294944914
[ 76.224192] timer B: 4294944915
edit:
Locking is probably required, because hrtimers
just get enqueued an any CPU. If two of them are enqueued on the same, it might happen, that they delay each other, but there is no guarantee.
from hrtimer.h:
* On SMP it is possible to have a "callback function running and enqueued"
* status. It happens for example when a posix timer expired and the callback
* queued a signal. Between dropping the lock which protects the posix timer
* and reacquiring the base lock of the hrtimer, another CPU can deliver the
* signal and rearm the timer.
In Linux 2.6.11.12, before the shedule() function to select the "next" task to run, it will lock the runqueue
spin_lock_irq(&rq->lock);
and the, before calling context_switch() to perform the context switching, it will call prepare_arch_switch(), which is a no-op by default:
/*
* Default context-switch locking:
*/
#ifndef prepare_arch_switch
# define prepare_arch_switch(rq, next) do { } while (0)
# define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock)
# define task_running(rq, p) ((rq)->curr == (p))
#endif
that is, it will hold the rq->lock until switch_to() return, and then, the macro finish_arch_switch() actually releases the lock.
Suppose that, there are tasks A, B, and C. And now A calls schedule() and switch to B (now, the rq->lock is locked). Sooner or later, B calls schedule(). At this point, how would B to get rq->lock since it is locked by A?
There is also some arch-dependent implememtation, such as:
/*
* On IA-64, we don't want to hold the runqueue's lock during the low-level context-switch,
* because that could cause a deadlock. Here is an example by Erich Focht:
*
* Example:
* CPU#0:
* schedule()
* -> spin_lock_irq(&rq->lock)
* -> context_switch()
* -> wrap_mmu_context()
* -> read_lock(&tasklist_lock)
*
* CPU#1:
* sys_wait4() or release_task() or forget_original_parent()
* -> write_lock(&tasklist_lock)
* -> do_notify_parent()
* -> wake_up_parent()
* -> try_to_wake_up()
* -> spin_lock_irq(&parent_rq->lock)
*
* If the parent's rq happens to be on CPU#0, we'll wait for the rq->lock
* of that CPU which will not be released, because there we wait for the
* tasklist_lock to become available.
*/
#define prepare_arch_switch(rq, next) \
do { \
spin_lock(&(next)->switch_lock); \
spin_unlock(&(rq)->lock); \
} while (0)
#define finish_arch_switch(rq, prev) spin_unlock_irq(&(prev)->switch_lock)
In this case, I'm very sure that this version will do things right since it unlock the rq->lock before calling context_switch().
But what happens to the default implementation? How it can do things right?
I found a comment in context_switch() of linux 2.6.32.68, that tells the story under the code:
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
yet we don't switch to another task with the lock locked, the next task will unlock it, and if the next task is newly created, the function ret_from_fork() will also eventually call finish_task_switch() to unlock the rq->lock