I found that there is a function called task_running in the kernel, and its judgment logic is as follows
static inline int task_current(struct rq *rq, struct task_struct *p)
{
return rq->curr == p;
}
static inline int task_running(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
return p->on_cpu;
#else
return task_current(rq, p);
#endif
}
Is there any difference between rq->curr and p->on_cpu? I think they both mean that the process is being scheduled by the current cpu. Why are separate judgments required under SMP?
On a multiprocessor system (CONFIG_SMP) the scheduler usually needs to continuously acquire and releases the spinlocks p->pi_lock (for tasks i.e. struct task_struct) and rq->lock (for runqueues i.e. struct rq). Since lock contention is one of the main factors of slowdown in the case of multiprocessor systems, the on_cpu field of task_struct was added to avoid acquiring rq->lock to look at rq->curr.
On a uniprocessor system (no CONFIG_SMP), there is only one CPU with one runqeueue, and there is no need to acquire runqueue locks when checking the running task. Since the scheduler will not do any locking anyway, we can also avoid having the on_cpu field in task_struct, saving some memory and also dumping all the code that deals with it. In fact, if you take a look at the code for struct task_struct, you can see:
struct task_struct {
/* ... */
#ifdef CONFIG_SMP
int on_cpu;
/* ... */
#endif
/* ... */
}
Here's the relevant commit that implemented this optimization. It was part of this patchwork by Peter Zijlstra to reduce lock contention in the scheduler (expand the "related" field to list everything).
I want to retrieve the sessionid from a task struct in an eBPF program. I have the following code in my eBPF program:
struct task_struct *task;
u32 sessionid;
task = (struct task_struct *)bpf_get_current_task();
sessionid = task->sessionid;
This runs, but the sessionid always ends up being -1. I read in this answer that I can use task_session to retrieve it, but I get an error about invalid memory access. I believe I need to use bpf_probe_read to move the task_struct that task points to onto the stack, but I can't get it to work. Is there anything I'm missing?
After a bit more digging through the task_struct struct I realised you could do this:
struct task_struct *task;
struct pid_link pid_link;
struct pid pid;
unsigned int sessionid;
task = (struct task_struct *)bpf_get_current_task();
bpf_probe_read(&pid_link, sizeof(pid_link), (void *)&task->group_leader->pids[PIDTYPE_SID]);
bpf_probe_read(&pid, sizeof(pid), (void *)pid_link.pid);
sessionid = pid.numbers[0].nr;
I am creating a kernel module to find the resident pages for all the process. I am using get_mm_rss() and for_each_process but it works
only for init process / first time after first iteration it doesn't work.
int __init schedp(void){
struct task_struct *p;
for_each_process(p) {
int pid = task_pid_nr(p);
printk("Process: %s (pid = %d) , (rpages : %lu)\n",
p->comm, pid, get_mm_rss(p->mm));
}
return 0;
}
Results:
BUG: unable to handle kernel NULL pointer dereference at 00000160,
You're probably getting NULL in p->mm, because some tasks may have invalid mm pointer because they are exiting or don't have mm (because they are kernel-threads, not sure).
When you confused on how to use kernel API, always look for examples inside kernel itself. Quick search with cross-reference tool gave me kernel/cpu.c:
for_each_process(p) {
struct task_struct *t;
/*
* Main thread might exit, but other threads may still have
* a valid mm. Find one.
*/
t = find_lock_task_mm(p);
if (!t)
continue;
cpumask_clear_cpu(cpu, mm_cpumask(t->mm));
task_unlock(t);
}
Note that you need to call find_lock_task_mm() and task_unlock() and explicitly check for NULL.
finally it works after creating a function which checks a mm_struct is valid or not for_each_process
struct task_struct *task_mm(struct task_struct *p){
struct task_struct *t;
rcu_read_lock();
for_each_thread(p, t) {
task_lock(t);
if (likely(t->mm))
goto found;
task_unlock(t);
}
t = NULL;
found:
rcu_read_unlock();
return t;
}
As I was going through the below chunk of Linux char driver code, I found the structure pointer current in printk.
I want to know what structure the current is pointing to and its complete elements.
What purpose does this structure serve?
ssize_t sleepy_read (struct file *filp, char __user *buf, size_t count, loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) going to sleep\n",
current->pid, current->comm);
wait_event_interruptible(wq, flag != 0);
flag = 0;
printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
return 0;
}
It is a pointer to the current process ie, the process which has issued the system call.
From the docs:
The Current Process
Although kernel modules don't execute sequentially as applications do,
most actions performed by the kernel are related to a specific
process. Kernel code can know the current process driving it by
accessing the global item current, a pointer to struct task_struct,
which as of version 2.4 of the kernel is declared in
<asm/current.h>, included by <linux/sched.h>. The current pointer
refers to the user process currently executing. During the execution
of a system call, such as open or read, the current process is the one
that invoked the call. Kernel code can use process-specific
information by using current, if it needs to do so. An example of this
technique is presented in "Access Control on a Device File", in
Chapter 5, "Enhanced Char Driver Operations".
Actually, current is not properly a global variable any more, like it
was in the first Linux kernels. The developers optimized access to the
structure describing the current process by hiding it in the stack
page. You can look at the details of current in <asm/current.h>. While
the code you'll look at might seem hairy, we must keep in mind that
Linux is an SMP-compliant system, and a global variable simply won't
work when you are dealing with multiple CPUs. The details of the
implementation remain hidden to other kernel subsystems though, and a
device driver can just include and refer to the
current process.
From a module's point of view, current is just like the external
reference printk. A module can refer to current wherever it sees fit.
For example, the following statement prints the process ID and the
command name of the current process by accessing certain fields in
struct task_struct:
printk("The process is \"%s\" (pid %i)\n",
current->comm, current->pid);
The command name stored in current->comm is the base name of the
program file that is being executed by the current process.
Here is the complete structure the "current" is pointing to
task_struct
Each task_struct data structure describes a process or task in the system.
struct task_struct {
/* these are hardcoded - don't touch */
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
long counter;
long priority;
unsigned long signal;
unsigned long blocked; /* bitmap of masked signals */
unsigned long flags; /* per process flags, defined below */
int errno;
long debugreg[8]; /* Hardware debugging registers */
struct exec_domain *exec_domain;
/* various fields */
struct linux_binfmt *binfmt;
struct task_struct *next_task, *prev_task;
struct task_struct *next_run, *prev_run;
unsigned long saved_kernel_stack;
unsigned long kernel_stack_page;
int exit_code, exit_signal;
/* ??? */
unsigned long personality;
int dumpable:1;
int did_exec:1;
int pid;
int pgrp;
int tty_old_pgrp;
int session;
/* boolean value for session group leader */
int leader;
int groups[NGROUPS];
/*
* pointers to (original) parent process, youngest child, younger sibling,
* older sibling, respectively. (p->father can be replaced with
* p->p_pptr->pid)
*/
struct task_struct *p_opptr, *p_pptr, *p_cptr,
*p_ysptr, *p_osptr;
struct wait_queue *wait_chldexit;
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
unsigned long timeout, policy, rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
long utime, stime, cutime, cstime, start_time;
/* mm fault and swap info: this can arguably be seen as either
mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
unsigned long swap_address;
unsigned long old_maj_flt; /* old value of maj_flt */
unsigned long dec_flt; /* page fault count of the last time */
unsigned long swap_cnt; /* number of pages to swap on next pass */
/* limits */
struct rlimit rlim[RLIM_NLIMITS];
unsigned short used_math;
char comm[16];
/* file system info */
int link_count;
struct tty_struct *tty; /* NULL if no tty */
/* ipc stuff */
struct sem_undo *semundo;
struct sem_queue *semsleeping;
/* ldt for this task - used by Wine. If NULL, default_ldt is used */
struct desc_struct *ldt;
/* tss for this task */
struct thread_struct tss;
/* filesystem information */
struct fs_struct *fs;
/* open file information */
struct files_struct *files;
/* memory management info */
struct mm_struct *mm;
/* signal handlers */
struct signal_struct *sig;
#ifdef __SMP__
int processor;
int last_processor;
int lock_depth; /* Lock depth.
We can context switch in and out
of holding a syscall kernel lock... */
#endif
};
I have read a few things from which I can make out that instead of scheduling a task with a scheduling policy it is better that we schedule an entity with a scheduling policy. The advantages being that you can schedule many things with the same scheduling policy. So there are two entities defined for two scheduling policies(CFS and RT) namely as sched_entity and sched_rt_entity. The code for CFS entity is (from v3.5.4)
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 nr_migrations;
#ifdef CONFIG_SCHEDSTATS
struct sched_statistics statistics;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
};
and for RT(real time) entity is
struct sched_rt_entity {
struct list_head run_list;
unsigned long timeout;
unsigned int time_slice;
struct sched_rt_entity *back;
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity *parent;
/* rq on which this entity is (to be) queued: */
struct rt_rq *rt_rq;
/* rq "owned" by this entity/group: */
struct rt_rq *my_q;
#endif
};
Both of these uses the list_head structures defined in ./include/linux/types.h
struct list_head {
struct list_head *next, *prev;
};
I honestly do not understand how any such thing is going to be scheduled. Can anyone explain how this is working.
P.S.:
Moreover, I am really having a hard time understanding the meaning of the names of data members. Can anyone suggest a good read for understanding kernel structures so that I can figure out these things a bit easily. Most of the time I spend is wasted in searching what a data member could mean.
Scheduling entities were introduced in order to implement group scheduling, so that CFS (or RT scheduler) will provide fair CPU time for individual tasks but also fair CPU time to groups of tasks. Scheduling entity may be either a task or group of tasks.
struct list_head is just Linux way to implement linked list. In the code you posted fields group_node and run_list allow to create lists of struct sched_entity and struct sched_rt_entity. More information can be found here.
Using these list_heads scheduling entities are stored in certain scheduler related data structures, for example cfs_rq.cfs_tasks if an entity is a task enqueued using account_entity_enqueue().
Always up to date documentation of Linux kernel can be found within its sources. In this case you should check this directory and especially this file which describes CFS. There is also an explanation of task groups.
EDIT: task_struct contains a field se of type struct sched_entity. Then, having an address to a sched_entity object using container_of macro it is possible to retrieve an address to the task_struct object, see task_of(). (address of sched_entity object - offset of se in task_struct = address of task_struct object) This is quite common trick used also in the implementation of lists I mentioned earlier in this answer.