What is the Linux process kernel stack state at process creation? - linux

I can't find this information anywhere. Everywhere I look, I find things referring to how the stack looks once you hit "main" (whatever your entry point is), which would be the program arguments, and environment, but what I'm looking for is how the system sets up the stack to cooperate with the switch_to macro. The first time the task gets switched to, it would need to have EFLAGS, EBP, the registers that GCC saves, and the return address from the schedule() function on the stack pointed to by "tsk->thread->esp", but what I can't figure out is how the kernel sets up this stack, since it lets GCC save the general purpose registers (using the output parameters for inline assembly).
I am referring to x86 PCs only. I am researching the Linux scheduler/process system for my own small kernel I am (attempting) to write, and I can't get my head around what I'm missing. I know I'm missing something since the fact that Slackware is running on my computer is a testament to the fact that the scheduler works :P
EDIT: I seem to have worded this badly. I am looking for information on how the tasks kernel stack is setup not how the tasks user task is setup. More specifically, the stack which tsk->thread->esp points to, and that "switch_to" switches to.

The initial kernel stack for a new process is set in copy_thread(), which is an arch-specific function. The x86 version, for example, starts out like this:
int copy_thread(unsigned long clone_flags, unsigned long sp,
unsigned long unused,
struct task_struct *p, struct pt_regs *regs)
{
struct pt_regs *childregs;
struct task_struct *tsk;
int err;
childregs = task_pt_regs(p);
*childregs = *regs;
childregs->ax = 0;
childregs->sp = sp;
p->thread.sp = (unsigned long) childregs;
p->thread.sp0 = (unsigned long) (childregs+1);
p->thread.ip = (unsigned long) ret_from_fork;
p->thread.sp and p->thread.ip are the new thread's kernel stack pointer and instruction pointer respectively.
Note that it does not place a saved %eflags, %ebp etc there, because when a newly-created thread of execution is first switched to, it starts out executing at ret_from_fork (this is where __switch_to() returns to for a new thread), which means that it doesn't execute the second-half of the switch_to() routine.

The state of the stack at process creation is described in the X86-64 SVR4 ABI supplement (for AMD64, ie x86-64 64 bits machines). The equivalent for 32 bits Intel processor is probably ABI i386. I strongly recommend reading also Assembly HOWTO. And of course, you should perhaps read the relevant Linux kernel file.

Google for "linux stack layout process startup" gives this link: "Startup state of a Linux/i386 ELF binary", which describes the set up that the kernel performs just before transferring control to the libc startup code.

Related

How 'task_struct' is accessed via 'thread_info' in linux latest kernel?

Background :
I am a beginner in the area of linux kernel. I just started to understand Linux kernel by reading a book 'Linux kernel Development - Third Edition' by Robert Love. Most of the explanations in this book are based on Linux kernel 2.6.34.
Hence, I am sorry, if this is repetitive question, but I could not find any info related to this in stack overflow.
Question:
What I understood from the book is that, each thread in linux has a structure called 'thread_info', which has pointer to its process/task.
This 'thread_info' is stored and the end of the kernel stack for each alive thread.
and the 'thread_info' has a pointer to its belonging task as below.
struct thread_info {
struct task_struct *task;
...
};
But when I checked the same structure in the latest linux code, I see a very different thread_info structure as below. (https://elixir.bootlin.com/linux/v5.16-rc1/source/arch/x86/include/asm/thread_info.h). It does not have 'task_struct' in it.
struct thread_info {
unsigned long flags; /* low level flags */
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
u32 status; /* thread synchronous flags */
#ifdef CONFIG_SMP
u32 cpu; /* current CPU */
#endif
};
My Question is, that if 'thread_info' structure does not have its related task structure here, then how does it find the information about its address space?
Also, If you know any good book on the latest linux kernel, please provide links to me.
Pointer to the current task_struct object is stored in architecture-dependent way. On x86 it is stored in per-CPU variable:
DECLARE_PER_CPU(struct task_struct *, current_task);
(In arch/x86/include/asm/current.h).
For find out how current task_struct is stored on particular architecture and/or in particular kernel version just search for implementation of current macro: exactly that macro is responsible for returning a pointer to the task_struct of the current process.

Does asmlinkage mean stack or registers?

In most languages, C included, the stack is used for function calls. That's why you get a "Stack Overflow" error if you are not careful in recursion. (Pun not intended).
If that is true, then what is so special about the asmlinkage GCC directive.
It says, from #kernelnewbies
The asmlinkage tag is one other thing that we should observe about
this simple function. This is a #define for some gcc magic that tells
the compiler that the function should not expect to find any of its
arguments in registers (a common optimization), but only on the CPU's
stack.
I mean I don't think the registers are used in normal function calls.
What is even more strange is when you learn it is implemented using the GCC regparm function attribute on x86.
The documentation of regparm is as follows:
On x86-32 targets, the regparm attribute causes the compiler to pass
arguments number one to number if they are of integral type in
registers EAX, EDX, and ECX instead of on the stack.
This is basically saying the opposite of what asmlinkage is trying do.
So what happens? Are they on the stack or in the registers.
Where am I going wrong?
The information isn't very clear.
On x86 32bit, the asmlinkage macro expands to __attribute__((regparam(0))), which basically tells GCC that no parameters should be passed through registers (the 0 is the important part). As of Linux 5.17, x86-32 and Itanium64 seem to be the only two architectures re-defining this macro, which by default expands to no attribute at all.
So asmlinkage does not by itself mean "parameters are passed on the stack". By default, the normal calling convention is used. This includes x86 64bit, which follows the System V AMD64 ABI calling convention, passing function parameters through RDI, RSI, RDX, RCX, R8, R9, [XYZ]MM0–7.
HOWEVER there is an important clarification to make: even with no special __attribute__ to force the compiler to use the stack for parameters, syscalls in recent kernel versions still take parameters from the stack indirectly through a pointer to a pt_regs structure (holding all the user-space registers saved on the stack on syscall entry). This is achieved through a moderately complex set of macros (SYSCALL_DEFINEx) that does everything transparently.
So technically, although asmlinkage does not change the calling convention, parameters are not passed inside registers as one would think by simply looking at the syscall function signature.
For example, the following syscall:
SYSCALL_DEFINE3(something, int, one, int, two, int, three)
{
// ...
do_something(one, two, three);
// ...
}
Actually becomes (roughly):
asmlinkage __x64_sys_something(struct pt_regs *regs)
{
// ...
do_something(regs->di, regs->si, regs->dx);
// ...
}
Which compiles to something like:
/* ... */
mov rdx,QWORD PTR [rdi+0x60]
mov rsi,QWORD PTR [rdi+0x68]
mov rdi,QWORD PTR [rdi+0x70]
call do_something
/* ... */
On i386 and x86-64 at least, asmlinkage means to use the standard calling convention you'd get with no GCC options and no __attribute__. (Like what user-space programs normally use for that target.)
For i386, that means stack args only. For x86-64, it's still the same registers as usual.
For x86-64, there's no difference; the kernel already uses the standard calling convention from the AMD64 System V ABI doc everywhere, because it's well-designed for efficiency, passing the first 6 integer args in registers.
But i386 has more historical baggage, with the standard calling convention (i386 SysV ABI) inefficiently passing all args on the stack. Presumably at some point in ancient history, Linux was compiled by GCC using this convention, and the hand-written asm entry points that called C functions were already using that convention.
So (I'm guessing here), when Linux wanted to switch from gcc -m32 to gcc -m32 -mregparm=3 to build the kernel with a more efficient calling convention, they had a choice to either modify the hand-written asm at the same time to use the new convention, or to force a few specific C functions to still use the traditional calling convention so the hand-written asm could stay the same.
If they'd made the former choice, asmlinkage for i386 would be __attribute__((regparm(3))) to force that convention even if the kernel is compiled a different way.
But instead, they chose to keep the asm the same, and #define asmlinkage __attribute__((regparm(0))) for i386, which indeed is zero register args, using the stack right away.
I don't know if that maybe had any debugging benefit, like in terms of being able to see what args got passed into a C function from asm without the only copy likely getting modified right away.
If -mregparm=3 and the corresponding attribute were new GCC features, Linus probably wanted to keep it possible to build the kernel with older GCC. That would rule out changing the asm to require __attribute__((regparm(3))). The asmlinkage = regparm(0) choice they actually made also has the advantage of not having to modify any asm, which means no correctness concerns, and that can be disentangled from any possible GCC bugs with using the new(?)-at-the-time calling convention.
At this point I think it would be totally possible to modify the asm code that calls asmlinkage functions, and swap it to being regparm(3). But that's a pretty minor thing. And not worth doing now since 32-bit x86 kernels are long since obsolete for almost all use cases. You almost always want a 64-bit kernel even if using a 32-bit user-space.
There might even be an efficiency benefit to stack args if saving the registers at a system-call entry point involved saving them with EBX at the lowest address, where they're already in place to be used as function args. You'd be all set to call *ia32_sys_call_table(,%eax,4). But that isn't actually safe because callees own their stack args and are allowed to write them, even though GCC usually doesn't use the incoming stack arg locations as scratch space. So I doubt Linux would have done this.
Other ISAs cope just fine with asmlinkage passing args in registers, so there's nothing fundamental about stack args that's important for how Linux works. (Except possibly for i386-specific code, but I doubt even that.)
The whole "asmlinkage means to pass args on the stack" is purely an i386 thing.
Most other ISAs that Linux runs on are more recent than 32-bit x86 (and/or are RISC-like with more registers), and have a standard calling convention that's more efficient with modern CPUs, using registers for the first few args. That includes x86-64.

Code sequences for TLS on ARM

The ELF Handling For Thread-Local Storage document gives assembly sequences for the various models (local exec/initial exec/general dynamic) for various architectures. But not ARM -- is there anywhere I can see such code sequences for ARM? I'm working on a compiler and want to generate code that will operate properly with the platform linkers (both program and dynamic).
For clarity, let's assume an ARMv7 CPU and a pretty new kernel and glibc (say 3.13+ / 2.19+), but I'd also be interested in what has to change for older hw/sw if that's easy to explain.
I don't exactly understand what you want. However, the assembler sequences (for ARMv6+ and a capable kernel) are,
mrc p15, 0, rX, c13, c0, 2 # get the user r/w register
This is called TPIDRURW in some ARM manuals. Your TLS tables/structure must be parented from this value (probably a pointer). Using the mcr is faster, but you can also call the helper (see below) if you don't set HWCAP_TLS in your ELF (which can be used on all ARM CPUs supported by Linux).
The intent of address 0xffff0fe8 seems to be that you can use those 4-bytes instead of using the above assembler directly with (rX == r0) as maybe it is different for some machine somewhere.
It is dependent on the CPU type. There is a helper in the vector page #0xffff0fe0 in entry-armv.S; it is in the process/thread structure if the hardware doesn't support it. Documentation is in kernel_user_helpers.txt
Usage example:
typedef void * (__kuser_get_tls_t)(void);
#define __kuser_get_tls (*(__kuser_get_tls_t *)0xffff0fe0)
void foo()
{
void *tls = __kuser_get_tls();
printf("TLS = %p\n", tls);
}
You do a syscall to set the TLS stuff. clone is a way to setup a thread context. The thread_info holds all register for a thread; it may share an mm (memory management or process memory view) with other task_struct. Ie, the thread_info has a tp_value for each created thread.
Here is a dicussion of the ARM implementation. ELF/nptl/glibc and Linux kernel are all involved (and/or search terms to investigate more). The syscall for get_tls() was probably too expensive and the current mainline has a vector page helper (mapped by all threads/processes).
Some glibc source, tls-macros.h, tlsdesc.c, etc. Most likely a full/concise answer will depend on the version of,
Your ARM CPU.
Your Linux kernel.
Your glibc.
Your compiler (and flags!).

System calls : difference between sys_exit(), SYS_exit and exit()

What is the difference between SYS_exit, sys_exit() and exit()?
What I understand :
The linux kernel provides system calls, which are listed in man 2 syscalls.
There are wrapper functions of those syscalls provided by glibc which have mostly similar names as the syscalls.
My question : In man 2 syscalls, there is no mention of SYS_exit and sys_exit(), for example. What are they?
Note : The syscall exit here is only an example. My question really is : What are SYS_xxx and sys_xxx()?
I'll use exit() as in your example although this applies to all system calls.
The functions of the form sys_exit() are the actual entry points to the kernel routine that implements the function you think of as exit(). These symbols are not even available to user-mode programmers. That is, unless you are hacking the kernel, you cannot link to these functions because their symbols are not available outside the kernel. If I wrote libmsw.a which had a file scope function like
static int msw_func() {}
defined in it, you would have no success trying to link to it because it is not exported in the libmsw symbol table; that is:
cc your_program.c libmsw.a
would yield an error like:
ld: cannot resolve symbol msw_func
because it isn't exported; the same applies for sys_exit() as contained in the kernel.
In order for a user program to get to kernel routines, the syscall(2) interface needs to be used to effect a switch from user-mode to kernel mode. When that mode-switch (somtimes called a trap) occurs a small integer is used to look up the proper kernel routine in a kernel table that maps integers to kernel functions. An entry in the table has the form
{SYS_exit, sys_exit},
Where SYS_exit is an preprocessor macro which is
#define SYS_exit (1)
and has been 1 since before you were born because there hasn't been reason to change it. It also happens to be the first entry in the table of system calls which makes look up a simple array index.
As you note in your question, the proper way for a regular user-mode program to access sys_exit is through the thin wrapper in glibc (or similar core library). The only reason you'd ever need to mess with SYS_exit or sys_exit is if you were writing kernel code.
This is now addressed in man syscall itself,
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). (The dispatch table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
At least now /usr/include/asm/unistd.h is a preprocessor hack that links to either,
/usr/include/asm/unistd_32.h
/usr/include/asm/unistd_x32.h
/usr/include/asm/unistd_64.h
The C function exit() is defined in stdlib.h. Think of this as a high level event driven interface that allows you to register a callback with atexit()
/* Call all functions registered with `atexit' and `on_exit',
in the reverse of the order in which they were registered,
perform stdio cleanup, and terminate program execution with STATUS. */
extern void exit (int __status) __THROW __attribute__ ((__noreturn__));
So essentially the kernel provides an interface (C symbols) called __NR_xxx. Traditionally people want sys_exit() which is defined with a preprocessor macro SYS_exit. This macro creates the sys_exit() function. The exit() function is part of the standard C library stdlib.h and ported to other operating systems that lack the Linux Kernel ABI entirely (there may not be __NR_xxx functions) and potentially don't even have sys_* functions available either (you could write exit() to send the interrupt or use VDSO in Assembly).

ARM and Linux spin_lock_irqsave concern

This is my first query in stack exchange so please bear with me. Almost all the questions which come to my mind already got resolved from the forum, but I cannot able to found this one.
I have made a simple device driver in Linux where in my_init() function I have written following code:-
spinlock_t mylock = SPIN_LOCK_UNLOCKED
static int __init my_init()
{
unsigned long flags;
printk("Testing spinlock\n");
spin_lock_irqsave(&mylock, flag);
printk("Grabbing spinlock and return\n");
}
Thus simply I am returning without releasing the spinlock.According to the theory and Linux source code, the Interrupt got disabled in ARM. So I seen CPSR register of ARM using debugger with 'I' bit gets masked and thus IRQ are disabled. However to my surprise the Linux prompt and even schedule() function are working as usual.
So my query is in Linux do we use IRQ mode only for some of the peripherals? If this is the case how can we guarantee perfect synchronization between Thread Context and Interrupt Context?
A bit detail about my Target : TI81xx Soc, Linux 3.2, Lauterbach Debugger.
Thanks

Resources