How does SIGSTOP work in Linux kernel? - linux

I am wondering how SIGSTOP works inside the Linux Kernel. How is it handled? And how the kernel stops running when it is handled?
I am familiar with the kernel code base. So, if you can reference kernel functions that will be fine, and in fact that is what I want. I am not looking for high level description from a user's perspective.
I have already bugged the get_signal_to_deliver() with printk() statements (it is compiling right now). But I would like someone to explain things in better details.

It's been a while since I touched the kernel, but I'll try to give as much detail as possible. I had to look up some of this stuff in various other places, so some details might be a little messy, but I think this gives a good idea of what happens under the hood.
When a signal is raised, the TIF_SIGPENDING flag is set in the process descriptor structure. Before returning to user mode, the kernel tests this flag with test_thread_flag(TIF_SIGPENDING), which will return true (because a signal is pending).
The exact details of where this happens seem to be architecture dependent, but you can see an example for um:
void interrupt_end(void)
{
struct pt_regs *regs = &current->thread.regs;
if (need_resched())
schedule();
if (test_thread_flag(TIF_SIGPENDING))
do_signal(regs);
if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
tracehook_notify_resume(regs);
}
Anyway, it ends up calling arch_do_signal(), which is also architecture dependent and is defined in the corresponding signal.c file (see the example for x86):
void arch_do_signal(struct pt_regs *regs)
{
struct ksignal ksig;
if (get_signal(&ksig)) {
/* Whee! Actually deliver the signal. */
handle_signal(&ksig, regs);
return;
}
/* Did we come from a system call? */
if (syscall_get_nr(current, regs) >= 0) {
/* Restart the system call - no handlers present */
switch (syscall_get_error(current, regs)) {
case -ERESTARTNOHAND:
case -ERESTARTSYS:
case -ERESTARTNOINTR:
regs->ax = regs->orig_ax;
regs->ip -= 2;
break;
case -ERESTART_RESTARTBLOCK:
regs->ax = get_nr_restart_syscall(regs);
regs->ip -= 2;
break;
}
}
/*
* If there's no signal to deliver, we just put the saved sigmask
* back.
*/
restore_saved_sigmask();
}
As you can see, arch_do_signal() calls get_signal(), which is also in signal.c.
The bulk of the work happens inside get_signal(), it's a huge function, but eventually it seems to process the special case of SIGSTOP here:
if (sig_kernel_stop(signr)) {
/*
* The default action is to stop all threads in
* the thread group. The job control signals
* do nothing in an orphaned pgrp, but SIGSTOP
* always works. Note that siglock needs to be
* dropped during the call to is_orphaned_pgrp()
* because of lock ordering with tasklist_lock.
* This allows an intervening SIGCONT to be posted.
* We need to check for that and bail out if necessary.
*/
if (signr != SIGSTOP) {
spin_unlock_irq(&sighand->siglock);
/* signals can be posted during this window */
if (is_current_pgrp_orphaned())
goto relock;
spin_lock_irq(&sighand->siglock);
}
if (likely(do_signal_stop(ksig->info.si_signo))) {
/* It released the siglock. */
goto relock;
}
/*
* We didn't actually stop, due to a race
* with SIGCONT or something like that.
*/
continue;
}
See the full function here.
do_signal_stop() does the necessary processing to handle SIGSTOP, you can also find it in signal.c. It sets the task state to TASK_STOPPED with set_special_state(TASK_STOPPED), a macro that is defined in include/sched.h that updates the current process descriptor status. (see the relevant line in signal.c). Further down, it calls freezable_schedule() which in turn calls schedule(). schedule() calls __schedule() (also in the same file) in a loop until an eligible task is found. __schedule() attempts to find the next task to schedule (next in the code), and the current task is prev. The state of prev is checked, and because it was changed to TASK_STOPPED, deactivate_task() is called, which moves the task from the run queue to the sleep queue:
} else {
...
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
...
}
deactivate_task() (also in the same file) removes the process from the runqueue by decrementing the on_rq field of the task_struct to 0 and calling dequeue_task(), which moves the process to the new (waiting) queue.
Then, schedule() checks the number of runnable processes and selects the next task to enter the CPU according to the scheduling policies in effect (I think this is a little bit out of scope by now).
At the end of the day, SIGSTOP moves a process from the runnable queue to a waiting queue until that process receives SIGCONT.

Nearly every time there is an interrupt, the kernel suspends some process from running and switches to running the interrupt handler (the only exception being when there is no process running). Likewise, the kernel will suspend processes that run too long without giving up the CPU (and technically that's the same thing: it just originates from the timer interrupt or possibly an IPI). Ordinarily in these cases, the kernel then puts the suspended process back on the run queue and when the scheduling algorithm decides the time is right, it is resumed.
In the case of SIGSTOP, the same basic thing happens: the affected processes are suspended due to the reception of the stop signal. They just don't get put back on the run queue until SIGCONT is sent. Nothing extraordinary here: SIGSTOP is just instructing the kernel to make a process non-runnable until further notice.
[One note: you seemed to imply that the kernel stops running with SIGSTOP. That is of course not the case. Only the SIGSTOPped processes stop running.]

Related

How does debug exceptions (ptrace single_step) work in arm64 on Linux kernel?

What happens when ptrace SINGLESTEP is called in aarch64, Linux kernel?
Linux reference for this question: 5.15.5 (latest stable in November 2021).
Nomenclature: tracer (the process which traces) and tracee (the traced process).
With static analysis + ftracing the linux kernel, I tried to reconstruct what precisely happens upon a ptrace SINGLESTEP call. I tried to write what I understood in a learning OS, failing in obtaining the same behavious. Before asking my question, let me summarize the process that I tried to reconstruct:
Single stepping is enabled in debug_monitor.c:
/* ptrace API */
void user_enable_single_step(struct task_struct *task)
{
struct thread_info *ti = task_thread_info(task);
if (!test_and_set_ti_thread_flag(ti, TIF_SINGLESTEP))
set_regs_spsr_ss(task_pt_regs(task));
}
This, in arm64, consists in setting the single step bit SPSR_EL1.SS (position 21):
/*
* Single step API and exception handling.
*/
static void set_user_regs_spsr_ss(struct user_pt_regs *regs)
{
regs->pstate |= DBG_SPSR_SS;
}
This, I imagine, should raise a "debug exception" (user level, EL0), catched and handled in entry-common.c:
static void noinstr el0_dbg(struct pt_regs *regs, unsigned long esr)
{
/* Only watchpoints write FAR_EL1, otherwise its UNKNOWN */
unsigned long far = read_sysreg(far_el1);
enter_from_user_mode(regs);
do_debug_exception(far, esr, regs);
local_daif_restore(DAIF_PROCCTX);
exit_to_user_mode(regs);
}
do_debug_exception() is defined in fault.c and, since it is a software step, it should call the function early_brk64 of debug_fault_info data structure:
/*
* __refdata because early_brk64 is __init, but the reference to it is
* clobbered at arch_initcall time.
* See traps.c and debug-monitors.c:debug_traps_init().
*/
static struct fault_info __refdata debug_fault_info[] = {
{ do_bad, SIGTRAP, TRAP_HWBKPT, "hardware breakpoint" },
{ do_bad, SIGTRAP, TRAP_HWBKPT, "hardware single-step" },
{ do_bad, SIGTRAP, TRAP_HWBKPT, "hardware watchpoint" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 3" },
{ do_bad, SIGTRAP, TRAP_BRKPT, "aarch32 BKPT" },
{ do_bad, SIGKILL, SI_KERNEL, "aarch32 vector catch" },
{ early_brk64, SIGTRAP, TRAP_BRKPT, "aarch64 BRK" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 7" },
};
The latter is defined in traps.c, it calls bug_handler function, which in turn calls arm64_skip_faulting_instruction(): the latter updates the PC (of the tracee) of 4 bytes (aarch64 instructions are on 32 bits):
void arm64_skip_faulting_instruction(struct pt_regs *regs, unsigned long size)
{
regs->pc += size;
/*
* If we were single stepping, we want to get the step exception after
* we return from the trap.
*/
if (user_mode(regs))
user_fastforward_single_step(current);
if (compat_user_mode(regs))
advance_itstate(regs);
else
regs->pstate &= ~PSR_BTYPE_MASK;
}
Finally, user_fastforward_single_step is called, which in turns calls clear_user_regs_spsr_ss which just reset the SS bit previously set:
static void clear_user_regs_spsr_ss(struct user_pt_regs *regs)
{
regs->pstate &= ~DBG_SPSR_SS;
}
If this chain of calls is correct, I can't understand where and how the context switching (from tracer to tracee) happens. Indeed, from these calls, it seems that points 2-6 regards the tracer. I can't notice a context switch in this chain, but it should.
I have tried to replicate all these steps in a learning OS. For the sake of clarity, I failed generating an exception after setting SPSR.SS bit (point 2), but I forced to generate an hardware debug exception by setting SS bit, MDE bit and KDE bit on MDSCR_EL1 register.
Once I did this trick, my steps were verified but, indeed, the exception was captured by the tracer (and not by the tracee as it should): PC of the tracer were updated, and so on.
I think that steps that I retrieved through static analysis of the code and ftrace are not completely correct. Can you help in identifying where?
You want to look inside the signal delivery code, eventually in kernel/signal.c. This is initiated in do_debug_exception, with its call to arm64_notify_die. You can trace this down to force_sig_fault and then we are in generic architecture-independent signal code.
The breakpoint exception causes SIGTRAP to be delivered to the traced process, and since it's being traced, every signal causes the tracee to stop, much as if it had been sent SIGSTOP. At this point the tracer is notified, in the same way a process is notified when a child exits or stops: if it's blocked on waitpid then waitpid will now return; if it's not then the next call to waitpid will return immediately. The tracer can make further ptrace calls to inspect or alter the tracee's state, and call ptrace(PTRACE_CONT) (analogous to SIGCONT) when it is ready for the tracee to take another step. It will set the appropriate flag so that the tracee ignores the SIGTRAP, which otherwise would typically terminate it.
So it's much the same flow as if a parent is in waitpid(pid, &status, WUNTRACED) and its child receives a SIGSTOP or SIGTSTP (e.g. Ctrl-Z hit in the terminal); the relevant kernel code isn't specific to debugging. In particular, there is not an explicit context switch from tracee directly to tracer. Rather, the tracee becomes stopped and the tracer becomes ready to run. The CPU then enters the scheduler. The tracer may be the next process picked to run, either by luck or if all other processes are sleeping; otherwise it has to wait for a timeslice like everyone else. The tracer may even be running on a different core, which is also fine.
Your analysis goes off course in step 4. The breakpoint handler is initialized to early_brk64 at boot, but as the comment on early_brk64 suggests, during boot, debug_traps_init hooks single_step_handler in its place. Then we go through send_user_sigtrap to arm64_force_sig_fault to force_sig_fault which is in generic architecture-independent kernel code.
Note that all of this is in the context of the tracee.
In particular, bug_handler() is not called during this process. That function is meant to handle a kernel bug, possibly by OOPS-killing the process, or panicking the kernel. early_brk64 calls it unconditionally, I think because it is only installed as a handler in early boot, before any userspace processes exist, and where the kernel should not under any circumstances be taking debug exceptions.

Linux: Real-Time Signals

I am testing different scenarios with real-time signals and unable to found the meaning of every signal such as SIGRTMIN+1 and SIGRTMIN+13.
I am able to send and receive the signals but trying to understand the meaning of all the SIGRTMIN+n signals. For example, I send number of Signals based on my program and one of them is killing a process:
/* The child process executes this function. */
void
child_function (void)
{
/* Perform initialization. */
printf ("I'm here!!! My pid is %d.\n", (int) getpid ());
/* Let parent know you’re done. */
kill (getppid (), SIGUSR1);
/* Continue with execution. */
puts ("Bye, now....");
exit (0);
}
I want to understand these SIGRTMIN+13, how does it work if I pass this to pass to kill a process forcefully.
real-time signals are designed to be used for application-defined purposes(it's up to you to give them a meaning),they have the advantage to be queued and accompanied with data which is not possible with standard signals, to use them effectively they are sent using sigqueue() "not kill()", and handled by sigaction() "not signal()".
I can really recommend reading this manual page. SIGRTMIN defines the lowest number you can choose for a user defined real time signal. By default, any program should behave undefined for this signal if you have not yet implemented the signal handler.

Interrupting open() with SIGALRM

We have a legacy embedded system which uses SDL to read images and fonts from an NFS share.
If there's a network problem, TTF_OpenFont() and IMG_Load() hang essentially forever. A test application reveals that open() behaves in the same way.
It occurred to us that a quick fix would be to call alarm() before the calls which open files on the NFS share. The man pages weren't entirely clear whether open() would fail with EINTR when interrupted by SIGALRM, so we put together a test app to verify this approach. We set up a signal handler with sigaction::sa_flags set to zero to ensure that SA_RESTART was not set.
The signal handler was called, but open() was not interrupted. (We observed the same behaviour with SIGINT and SIGTERM.)
I suppose the system treats open() as a "fast" operation even on "slow" infrastructure such as NFS.
Is there any way to change this behaviour and allow open() to be interrupted by a signal?
The man pages weren't entirely clear whether open() would fail with
EINTR when interrupted by SIGALRM, so we put together a test app to
verify this approach.
open(2) is a slow syscall (slow syscalls are those that can sleep forever, and can be awaken when, and if, a signal is caught in the meantime) only for some file types. In general, opens that block the caller until some condition occurs are usually interruptible. Known examples include opening a FIFO (named pipe), or (back in the old days) opening a physical terminal device (it sleeps until the modem is dialed).
NFS-mounted filesystems probably don't cause open(2) to sleep in an interruptible state. After all, you are most likely opening a regular file, and in that case open(2) will not be interruptable.
Is there any way to change this behaviour and allow open() to be
interrupted by a signal?
I don't think so, not without doing some (non-trivial) changes to the kernel.
I would explore the possibility of using setjmp(3) / longjmp(3) (see the manpage if you're not familiar; it's basically non-local gotos). You can initialize the environment buffer before calling open(2), and issue a longjmp(3) in the signal handler. Here's an example:
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
#include <unistd.h>
#include <signal.h>
static jmp_buf jmp_env;
void sighandler(int signo) {
longjmp(jmp_env, 1);
}
int main(void) {
struct sigaction sigact;
sigact.sa_handler = sighandler;
sigact.sa_flags = 0;
sigemptyset(&sigact.sa_mask);
if (sigaction(SIGALRM, &sigact, NULL) < 0) {
perror("sigaction(2) error");
exit(EXIT_FAILURE);
}
if (setjmp(jmp_env) == 0) {
/* First time through
* This is where we would open the file
*/
alarm(5);
/* Simulate a blocked open() */
while (1)
; /* Intentionally left blank */
/* If open(2) is successful here, don't forget to unset
* the alarm
*/
alarm(0);
} else {
/* SIGALRM caught, open(2) canceled */
printf("open(2) timed out\n");
}
return 0;
}
It works by saving the context environment with the help of setjmp(3) before calling open(2). setjmp(3) returns 0 the first time through, and returns whatever value was passed to longjmp(3) otherwise.
Please be aware that this solution is not perfect. Here are some points to keep in mind:
There is a window of time between the call to alarm(2) and the call to open(2) (simulated here with while (1) { ... }) where the process may be preempted for a long time, so there is a chance the alarm expires before we actually attempt to open the file. Sure, with a large timeout such as 2 or 3 seconds this will most likely not happen, but it's still a race condition.
Similarly, there is a window of time between successfully opening the file and canceling the alarm where, again, the process may be preempted for a long time and the alarm may expire before we get the chance to cancel it. This is slightly worse because we have already opened the file so we will "leak" the file descriptor. Again, in practice, with a large timeout this will likely never happen, but it's a race condition nevertheless.
If the code catches other signals, there may be another signal handler in the midst of execution when SIGALRM is caught. Using longjmp(3) inside the signal handler will destroy the execution context of these other signal handlers, and depending on what they were doing, very nasty things may happen (inconsistent state if the signal handlers were manipulating other data structures in the program, etc.). It's as if it started executing, and suddenly crashed somewhere in the middle. You can fix it by: a) carefully setting up all signal handlers such that SIGALRM is blocked before they are invoked (this ensures that the SIGALRM handler does not begin execution until other handlers are done) and b) blocking these other signals before catching SIGALRM. Both actions can be accomplished by setting the sa_mask field of struct sigaction with the necessary mask (the operating system atomically sets the process's signal mask to that value before beginning execution of the handler and unsets it before returning from the handler). OTOH, if the rest of the code doesn't catch signals, then this is not a problem.
sleep(3) may be implemented with alarm(2), and alarm(2) and setitimer(2) share the same timer; if other portions in the code make use of any of these functions, they will interfere and the result will be a huge mess.
Just make sure you weigh in these disadvantages before blindly using this approach. The use of setjmp(3) / longjmp(3) is usually discouraged and makes programs considerably harder to read, understand and maintain. It's not elegant, but right now I don't think you have a choice, unless you're willing to do some core refactoring in the project.
If you do end up using setjmp(3), then at the very least document these limitations.
Maybe there is a strategy of using a separate thread to do the open so the main thread is not held up longer than desired.

Linux kernel - wait queues

I'm reading "Linux kernel development 3rd edition by Robert Love" to get a general idea about how the Linux kernel works..(2.6.2.3)
I'm confused about how wait queues work for example this code:
/* ‘q’ is the wait queue we wish to sleep on */
DEFINE_WAIT(wait);
add_wait_queue(q, &wait);
while (!condition) { /* condition is the event that we are waiting for */
prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);
if (signal_pending(current))
/* handle signal */
schedule();
}
finish_wait(&q, &wait);
I want to know which process is running this code? is it a kernel thread? whose process time is this?
And also in the loop, while the condition is still not met we will continue sleeping and call schedule to run another process the question is when do we return to this loop?
The book says that when a process sleeps, it's removed from our run queue, else it would be waken and have to enter a busy loop...
Also says: "sleeping should always be handled in a loop that ensures that the condition for which the task is waiting has indeed occurred."
I just want to know in what context is this loop running?
Sorry if this is a stupid Question. I'm just having trouble seeing the big pic
Which process is running the code? The process that called it. I don't mean to make fun of the question but the gist is that kernel code can run in different contexts: Either because a system call led to this place, because it is in a interrupt handler, or because it is a callback function called from another context (such as workqueues or timer functions).
Since this example is sleeping, it must be in a context where sleeping is allowed, meaning it is executed in response to a system call or at least in a kernel thread. So the answer is the process time is taken from the process (or kernel thread) that called into this kernel code that needs to sleep. That is the only place where sleeping is allowed in the first place.
A certain special case are workqueues, these are explicitly for functions that need to sleep. Typical use would be to queue a function that needs to sleep from a context where sleeping is forbidden. In that case, the process context is that of one of the kernel worker threads designated to process workqueue items.
You will return to this loop when the wait_queue is woken up, which either sets one task waiting on the queue to runnable or all of them, depending on the wake_up function called.
The most important thing is, forget about this unless you are interested in the implementation details. Since many people got this wrong and it's basically the same thing everywhere it's needed, there have long been macros encapsulating the whole procedure. Look up wait_event(), that's how your example should really look like:
wait_event(q, condition);
As per your example... I added comments....
NOTE: while creating waiting queue by default it will be in sleep stat.
DEFINE_WAIT(wait); /* first wait ---> it the kernel global wait queue it is pointing */
add_wait_queue(q, &wait); /* first wait ---> it the kernel global wait queue it is pointing using add_wait_queue(q, &wait); ---> you are adding your own waiting queue (like appending linked list) */
while (!condition) {
/* condition is the event that we are waiting for */
/*condition --> Let's say you are getting data from user space in write method (using __get_user()) */
prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);
/* This will wait when any wake_up_process() call will be generated having interrupt */
if (signal_pending(current))
/* This is continuously monitoring if any signal is pending on current CPU on which wait queue is running while not pending any signal generally used return -ERESTARTSYS; or "break" the loop if interrupts came exa., SIGINT or SIGKILL and finishes wait queue statement to check again /
/ handle signal */
schedule(); // Scheduling of wait queue
// Remove from global data structure
}
finish_wait(&q, &wait); //Finishing wait queue

pthread condition variables on Linux, odd behaviour

I'm synchronizing reader and writer processes on Linux.
I have 0 or more process (the readers) that need to sleep until they are woken up, read a resource, go back to sleep and so on. Please note I don't know how many reader processes are up at any moment.
I have one process (the writer) that writes on a resource, wakes up the readers and does its business until another resource is ready (in detail, I developed a no starve reader-writers solution, but that's not important).
To implement the sleep / wake up mechanism I use a Posix condition value, pthread_cond_t. The clients call a pthread_cond_wait() on the variable to sleep, while the server does a pthread_cond_broadcast() to wake them all up. As the manual says, I surround these two calls with a lock/unlock of the associated pthread mutex.
The condition variable and the mutex are initialized in the server and shared between processes through a shared memory area (because I'm not working with threads, but with separate processes) an I'm sure my kernel / syscall support it (because I checked _POSIX_THREAD_PROCESS_SHARED).
What happens is that the first client process sleeps and wakes up perfectly. When I start the second process, it blocks on its pthread_cond_wait() and never wakes up, even if I'm sure (by the logs) that pthread_cond_broadcast() is called.
If I kill the first process, and launch another one, it works perfectly. In other words, the condition variable pthread_cond_broadcast() seems to wake up only one process a time. If more than one process wait on the very same shared condition variable, only the first one manages to wake up correctly, while the others just seem to ignore the broadcast.
Why this behaviour? If I send a pthread_cond_broadcast(), every waiting process should wake up, not just one (and, however, not always the same one).
Have you set the PTHREAD_PROCESS_SHARED attribute on both your condvar and mutex?
For Linux consult the following man pages:
pthread_mutexattr_init (with sample)
pthread_mutexattr_setpshared
pthread_condattr_init
pthread_condattr_setpshared
Methods, types, constants etc. are normally defined in /usr/include/pthread.h, /usr/include/nptl/pthread.h.
Do you test for some condition before calling pthread_cond_wait() ? I am asking because, it's a very common mistake : Your process must not call wait() unless you know some other process will call signal() (or broadcast()) later.
concidering this code (from pthread_cond_wait man page) :
pthread_mutex_lock(&mut);
while (x <= y) {
pthread_cond_wait(&cond, &mut);
}
/* operate on x and y */
pthread_mutex_unlock(&mut);
If your omit the while test, and just signal from another process whenever your (x <= y) condition is true, it won't work since the signal only wakes up the processes that are already waiting. If signal() was called before the other process calls wait() the signal will be lost and the waiting process will be waiting forever.
EDIT : About the while loop.
When you are signaling one process from another process it is set on the ''ready list'' but not necessarily scheduled and your condition (x <= y) may be change again since no one holds the lock. That's why you need to check for your condition each time you are about to wait. It should always be wakeup -> check if the condition is still true -> do work.
hope it's clear.
The documentation says that it should work... are you sure it's the same conditional value that the rest of the threads are looking at?
This is the example code from opengroup.org:
pthread_cond_wait(mutex, cond):
value = cond->value; /* 1 */
pthread_mutex_unlock(mutex); /* 2 */
pthread_mutex_lock(cond->mutex); /* 10 */
if (value == cond->value) { /* 11 */
me->next_cond = cond->waiter;
cond->waiter = me;
pthread_mutex_unlock(cond->mutex);
unable_to_run(me);
} else
pthread_mutex_unlock(cond->mutex); /* 12 */
pthread_mutex_lock(mutex); /* 13 */
pthread_cond_signal(cond):
pthread_mutex_lock(cond->mutex); /* 3 */
cond->value++; /* 4 */
if (cond->waiter) { /* 5 */
sleeper = cond->waiter; /* 6 */
cond->waiter = sleeper->next_cond; /* 7 */
able_to_run(sleeper); /* 8 */
}
pthread_mutex_unlock(cond->mutex); /* 9 */
what the last poster said is correct. the KEY to the whole cond-variable situation working correctly is that the cond-var is NOT signalled prior to it being waited on. its strictly a signal that is to be used when others (single or multiple) are waiting. when no one is waiting, its effectively a NOP. which, btw, is NOT how i believe it SHOULD work, but how it DOES work.
larry

Resources