In Linux, what do all the values in the "top" command mean? - linux

When you run top and see all running processes, I've always wanted to know just what everything actually means. e.g. all the various single-letter state codes for a running process (R = Running, S = Sleeping, etc...)
Where can I find this?

The man page says what the state codes are mapped to, but not what they actually mean. From the top man page:
'D' = uninterruptible sleep
'R' = running
'S' = sleeping
'T' = traced or stopped
'Z' = zombie
'R' is the easiest; the process is ready to run, and will run whenever its turn to use the CPU comes.
'S' and 'D' are two sleep states, where the process is waiting for something to happen. The difference is that 'S' can be interrupted by a signal, while 'D' cannot (it is usually seen when the process is waiting for the disk).
'T' is a state where the process is stopped, usually via SIGSTOP or SIGTSTP. It can also be stopped by a debugger (ptrace). When you see that state, it usually is because you used Ctrl+ Z to put a command on the background.
'Z' is a state where the process is dead (it has finished its execution), and the only thing left is the structure describing it on the kernel. It is waiting for its parent process to retrieve its exit code, and not much more. After its parent process is finished with it, it will disappear.

You can use the command man top to look up the states:
D = uninterruptible sleep
I = idle
R = running
S = sleeping
T = stopped by job control signal
t = stopped by debugger during trace
Z = zombie

Programs like top and ps takes these values from the kernel itself. You can find its definitions in the source code here:
https://github.com/torvalds/linux/blob/3950e975431bc914f7e81b8f2a2dbdf2064acb0f/fs/proc/array.c#L129-L143
static const char * const task_state_array[] = {
/* states in TASK_REPORT: */
"R (running)", /* 0x00 */
"S (sleeping)", /* 0x01 */
"D (disk sleep)", /* 0x02 */
"T (stopped)", /* 0x04 */
"t (tracing stop)", /* 0x08 */
"X (dead)", /* 0x10 */
"Z (zombie)", /* 0x20 */
"P (parked)", /* 0x40 */
/* states beyond TASK_REPORT: */
"I (idle)", /* 0x80 */
};
For more info see this question: https://unix.stackexchange.com/q/462098/79648

Related

How to resolve this mistake in Petersons algorithm for process synchronization

Information
I was reading the book of E. Tanenbaum about Modern operating systems and there was a code snippet that was introducing Petersons algorithm for process synchronization which is implemented with software.
Here's the snippet.
```
#define FALSE 0
#define TRUE 1
#define N 2 /* number of processes */
int turn; /* whose turn is it? */
int interested[N]; /* all values initially 0 (FALSE) */
void enter_region(int process) /* process is 0 or 1 */
{
int other; /* number of the other process */
other = 1 − process; /* the opposite of process */
interested[process] = TRUE; /* show that you are interested */
turn = process; /*set flag*/
while (turn == process && interested[other] == TRUE); /* null statement */
}
void leave_region(int process) { /* process: who is leaving */
interested[process] = FALSE; /* indicate departure from critical region */
}
```
The question is
Isn't there a mistake? [Edit] Must'nt it be turn = other or maybe there is another mistake.
This version of algorithm violates rules of mutual exclusion.
[Edit]
I think this version is violating the rules of mutual exclusion. As if first process sets the interested variable than stops and other process runs, second process can idle wait after setting his interested and turn variables without any need as there is no any process in critical section.
Any answer and help is appreciated. Thanks!
If process 0 sets interested[0], and then process 1 runs enter_region up until the loop, then process 0 will be able to exit the loop because turn == process is no longer true for it. turn, in this case, really means "turn to wait", and protects against exactly the situation you described. In contrast, if the code did turn = other then process 0 would not be able to exit the loop until process 1 started waiting.

where is the context switching finally happening in the linux kernel source?

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.
Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

Confusion about init_task (pid 0 or pid 1?)

I'm playing with the Linux kernel, and one thing that I don't understand is the pid of the init_task task.
As far as I know, there are two special pids: pid 0 for the idle/swapper task, and pid 1 for the init task.
Every online resource (e.g. one, two) I could find say that the init_task task represents the swapper task, i.e. it should have pid 0.
But when I print all the pids using the for_each_process macro, which starts from init_task, I get pid 1 as the first process. I don't get pid 0 at all. Which means that init_task has pid 1, and that it's the init task (?!).
Please help me resolve this confusion.
P.S. the kernel version is 2.4.
The reason for my confusion was the tricky definition of the for_each_task macro:
#define for_each_task(p) \
for (p = &init_task ; (p = p->next_task) != &init_task ; )
Even though it seems that p starts from init_task, it actually starts from init_task.next_task because of the assignment in the condition.
So for_each_task(p) { /* ... */ } could be rewritten as:
p = init_task.next_task;
while(p != &init_task)
{
/* ... */
p = p->next_task;
}
As it can be seen, the swapper process is not part of the iteration.

How does linux know when to allocate more pages to a call stack?

Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}
The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.

ptrace: STEPINTO returns breakpoint signal

Yet another ptrace question:
The process is paused for the initial SIGTRAP, then i call ptrace with STEPINTO to step the next instruction. After the ptrace call, i call waitpid(...) and sure enough it does receive a SIGTRAP signal! Great!
however, when i examine the siginfo_t structure by calling
ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
the fields are as follows:
si_signo = SIGTRAP
si_code = 1
however, 1 is the signal code for a breakpoint. No breakpoint was hit in this case:
/* `si_code' values for SIGTRAP signal. */
enum
{
TRAP_BRKPT = 1, /* Process breakpoint. */
# define TRAP_BRKPT TRAP_BRKPT
TRAP_TRACE /* Process trace trap. */
# define TRAP_TRACE TRAP_TRACE
};
What's going on?
All calls involved return success (0) and errno is 0, too.
edit: so what happens when i hit a real 0xCC breakpoint?
A SIGTRAP is fired and the si_code field is 0x80 ! 0x80 isnt even defined for SIGTRAP! :(

Resources