Linux kernel loading process - linux

I'm reading about the Linux kernel loading process (just to understand the whole sequence) and I have several doubts specially about the control transition between:
The boot-loader and the kernel
The kernel and the init process
For example, in the wikipedia I found the following:
The kernel as loaded is typically an image file, compressed into either zImage or bzImage formats with zlib. A routine at the head of it does a minimal amount of hardware setup, decompresses the image fully into high memory, and takes note of any RAM disk if configured.[3] It then executes kernel startup via ./arch/i386/boot/head and the startup_32 ()
Here I have several questions:
What this routine stands for?
In which part of the memory is loaded?
Does it already include code to decompress the zImage or this code is loaded separately in another memory location?
I Continue reading on the same page and I found the following:
... start_kernel executes a wide range of initialization functions. It sets up interrupt handling (IRQs), further configures memory, starts the Init process (the first user-space process), ...
I know that the init is the first user-space process created. The answer to the following question:
How the init process is started in linux kernel?
states that the kernel uses a do_execve() call. However, the semantic for the normal execv system call is to override the calling process (the kernel in this case?) bss, data, text and stack segments with the ones from the new process and it doesn't return.
Why in this case it does return? (otherwise, if it doesn't return the kernel wont continue it's starting process)
Thanks in advance,

Related

Calling kexec_load from kernel code to reload the same kernel itself once

kexec_load is used to load another kernel from the current running kernel without reboot.
So my question is: How can I use it to load the current kernel again, from the current running kernel (aka it reloads itself), by calling kexec_load in the kernel code? I know kexec_load can be changed to a function that can be called from internal kernel, not just userspace. Also, I don't want to load the kernel file again, if possible.
I will have conditional code to make it reloads just once, so you don't need to worry about it will reload forever.

Linux memory/disk behavior during ELF execution

I have a series of executables running on a cluster that read in some small config files at execution start, and then do a lot of processing for several hours, and then write out some data and exit. Our sysadmin is trying to tell me that our fileserver is really slow because his analysis is showing that the cluster nodes are spending all their time using NFS disk I/O reading in ELF executables for execution, long after they've been spawned (note: our executables are only a few MB in size). This doesn't sound right to me, as I was under the impression that the dynamic linker loaded the entire executable into memory at runtime and then operated out of memory. I know that the kernel leaves an open file descriptor to the executable while it's running, but I didn't think it was continually reading from it.
My question is, is my understanding of how executables are loaded flawed? I find it hard to believe that the kernel is constantly doing file reads on the executable to fetch instructions, as this would be terribly slow (even with caching) because branch predictions are hardly reliable, so you'd be spending forever reading the executable from disk if your binary performed frequent jumps.
I was under the impression that the dynamic linker loaded the entire executable into memory at runtime and then operated out of memory.
Your impression is incorrect.
First, a minor inaccuracy: while the dynamic linker is responsible for loading shared libraries, the main executable itself is loaded by the kernel before dynamic loader is started.
Second, most current systems use demand paging. The files are mmaped, but the code isn't actually loaded into memory until that code is accessed (i.e. tries to execute). If you never execute some parts of the program, these parts are never loaded into memory at all.
I find it hard to believe that the kernel is constantly doing file reads on the executable to fetch instructions
It doesn't constantly do that. It typically loads the code into memory and the code stays there.
It is possible for the code to be discarded from memory (which would require reloading it again if it executes again) on a system that doesn't have enough memory (this is called thrashing).
because branch predictions are hardly reliable,
Branch prediction
has approximately nothing to do with your problem, and
is exceedingly good on modern CPUs.

What is a memory image in *nix systems?

In the book Advanced Programming in the Unix Environment 3rd Edition, Chapter 10 -- Signals, Page 315, when talking about the actions taken by the processes that receive a signal , the author says
When the default action is labeled "terminate+core", it means that a memory image of the process is left in the file named core of the current working directory of the process.
What is a memory image? When is this created, what's the content of it, and what is it used for?
A memory image is simply a copy of the process's virtual memory, saved in a file. It's used when debugging the program, as you can examine the values of the program's variables and determine which functions were being called at the time of the failure.
As the documentation you quoted says, this file is created when the process is terminated due to a signal that has the "terminate+core" default action.'
A memory image is often called a core image. See core(5) and the core dump wikipage.
Grossly speaking, a core image describes the process virtual address space (and content) at time of crash (including call stacks of each active thread and writable data segments for global data and heaps, but often excluding text or code segments which are read-only and given in the executable ELF file or in shared libraries). It also contains the register state (for each thread).
The name core is understandable only by old guys like me (having seen computers built in the 1960 & 1970-s like IBM/360, PDP-10 and early PDP-11, both used for developing the primordial Unix), since long time ago (1950-1970) random access memory was made with magnetic core memory.
If you have compiled all your source code with debug information (e.g. using gcc -g -Wall) you can do some post-mortem debugging (after yourprogram crashed and dumped a core file!) using gdb as
gdb yourprogram core
and the first gdb command you'll try is probably bt to get the backtrace.
Don't forget to enable core dumps, with the setrlimit(2) syscall generally done in your shell with e.g. ulimit  -c
Several signals can dump core, see signal(7). A common cause is a segmentation violation, like when you dereference a NULL or bad pointer, which gives a SIGSEGV signal which (often) dumps a core file in the current directory.
See also gcore(1).

Why do we need a bootloader in an embedded device?

I'm working with ELinux kernel on ARM cortex-A8.
I know how the bootloader works and what job it's doing. But i've got a question - why do we need bootloader, why was the bootloader born?
Why we can't directly load the kernel into RAM from flash memory without bootloader? If we load it what will happen? In fact, processor will not support it, but why are we following the procedure?
In the context of Linux, the boot loader is responsible for some predefined tasks. As this question is arm tagged, I think that ARM booting might be a useful resource. Specifically, the boot loader was/is responsible for setting up an ATAG list that describing the amount of RAM, a kernel command line, and other parameters. One of the most important parameters is the machine type. With device trees, an entire description of the board is passed. This makes a stock ARM Linux impossible to boot with out some code to setup the parameters as described.
The parameters allows one generic Linux to support multiple devices. For instance, an ARM Debian kernel can support hundreds of different board types. Uboot or other boot loader can dynamically determine this information or it can be hard coded for the board.
You might also like to look at bootloader info page here at stack overflow.
A basic system might be able to setup ATAGS and copy NOR flash to SRAM. However, it is usually a little more complex than this. Linux needs RAM setup, so you may have to initialize an SDRAM controller. If you use NAND flash, you have to handle bad blocks and the copy may be a little more complex than memcpy().
Linux often has some latent driver bugs where a driver will assume that a clock is initialized. For instance if Uboot always initializes an Ethernet clock for a particular machine, the Linux Ethernet driver may have neglected to setup this clock. This can be especially true with clock trees.
Some systems require boot image formats that are not supported by Linux; for example a special header which can initialize hardware immediately; like configuring the devices to read initial code from. Additionally, often there is hardware that should be configured immediately; a boot loader can do this quickly whereas the normal structure of Linux may delay this significantly resulting in I/O conflicts, etc.
From a pragmatic perspective, it is simpler to use a boot loader. However, there is nothing to prevent you from altering Linux's source to boot directly from it; although it maybe like pasting the boot loader code directly to the start of Linux.
See Also: Coreboot, Uboot, and Wikipedia's comparison. Barebox is a lesser known, but well structured and modern boot loader for the ARM. RedBoot is also used in some ARM systems; RedBoot partitions are supported in the kernel tree.
A boot loader is a computer program that loads the main operating system or runtime environment for the computer after completion of the self-tests.
^ From Wikipedia Article
So basically bootloader is doing just what you wanted - copying data from flash into operating memory. It's really that simple.
If you want to know more about boostrapping the OS, I highly recommend you read the linked article. Boot phase consists, apart from tests, also of checking peripherals and some other things. Skipping them makes sense only on very simple embedded devices, and that's why their bootloaders are even simpler:
Some embedded systems do not require a noticeable boot sequence to begin functioning and when turned on may simply run operational programs that are stored in ROM.
The same source
The primary bootloader is usually built in into the silicon and performs the load of the first USER code that will be run in the system.
The bootloader exists because there is no standardized protocol for loading the first code, since it is chip dependent. Sometimes the code can be loaded through a serial port, a flash memory, or even a hard drive. It is bootloader function to locate it.
Once the user code is loaded and running, the bootloader is no longer used and the correctness of the system execution is user responsibility.
In the embedded linux chain, the primary bootloader will setup and run the Uboot. Then Uboot will find the linux kernel and load it.
Why we can't directly load the kernel into RAM from flash memory without bootloader? If we load it what will happen? In fact, processor will not support it, but why are we following the procedure?
Bartek, Artless, and Felipe all give parts of the picture.
Every embedded processor type (E.G. 386EX, Coretex-A53, EM5200) will do something automatically when it is reset or powered on. Sometimes that something is different depending on whether the power is cycled or the device is reset. Some embedded processors allow you to change that something based on voltages applied to different pins when the device is powered or reset.
Regardless, there is a limited amount of something that a processor can do, because of the physical space on-processor required to define that something, whether it is on-chip FLASH, instruction micro-code, or some other mechanism.
This limit means that the something is
fixed purpose, does one thing as quickly as possible.
limited in scope and capability, typically loading a small block of code (often a few kilobytes or less) into a fixed memory location and executing from the start of the loaded code.
unmodifiable.
So what a processor does in response to reset or power-cycle cannot be changed, and cannot do very much, and we don't want it to automatically copy hundreds of megabytes or gigabytes into memory which may not exist or may not be initialized, and which could take a looooong time.
So....
We set up a small program which is smaller than the smallest size permitted across all of the devices we are going to use. That program is stored wherever the something needs it to be.
Sometimes the small program is U-Boot. Sometimes even U-Boot is too big for initial load, so the small program then in turn loads U-Boot.
The point is that whatever gets loaded by the something, is modifiable as needed for a particular system. If it is U-Boot, great, if not, it knows where to load the main operating system or where to load U-Boot (or some other bootloader).
U-Boot (speaking of bootloaders in general) then configures a minimal set of devices, memory, chip settings, etc., to enable the main OS to be loaded and started. The main OS init takes care of any additional configuration or initialization.
So the sequence is:
Processor power-on or reset
Something loads initial boot code (or U-Boot style embedded bootloader)
Initial boot code (may not be needed)
U-Boot (or other general embedded bootloader)
Linux init
The kernel requires the hardware on which you are working to be in a particular state. All the hardware you used needs to be checked for its state and initialized for its further operation. This is one of the main reasons to use a boot loader in an embedded (or any other environment), apart from its use to load a kernel image into the RAM.
When you turn on a system, the RAM is also not in a useful state (fully initialized to use) for us to load kernel into it. Therefore, we cannot load a kernel directly (to answer your question)and thus arises the need for a construct to initialize it.
Apart from what is stated in all the other answers - which is correct - in some cases the system has to go through different execution modes, take as example TrustZone for secure ARM chips. It is possible to still consider it as sort of HW initialization, but what makes it peculiar is the fact that there are additional limitations (ex: memory available) that make it impractical, if not impossible, to do everything in a single binary, thus multiple stages of bootloader are available.
Furthermore, for security reason, each of them is signed and can perform its job only if it meets the security requirements.

What happens when you run a program?

I would like to collect here what happens when you run an executable on Windows, Linux and OSX. In particular, I would like to understand exactly the order of the operations: my guess is that the executable file format (PE, ELF or Mach-O) is loaded by the kernel (but I ignore the various sections of the ELF(Executable and Linkable Format) and their meaning), and then you have the dynamic linker that resolves the references, then the __init part of the executable is run, then the main, then the __fini, and then the program is completed, but I am sure it's very rough, and probably wrong.
Edit: the question is now CW. I am filling up for linux. If anyone wants to do the same for Win and OSX it would be great.
This is just at a very high and abstract level of course!
Executable - No Shared Libary:
Client request to run application
->Shell informs kernel to run binary
->Kernel allocates memory from the pool to fit the binary image into
->Kernel loads binary into memory
->Kernel jumps to specific memory address
->Kernel starts processing the machine code located at this location
->If machine code has stop
->Kernel releases memory back to pool
Executable - Shared Library
Client request to run application
->Shell informs kernel to run binary
->Kernel allocates memory from the pool to fit the binary image into
->Kernel loads binary into memory
->Kernel jumps to specific memory address
->Kernel starts processing the machine code located at this location
->Kernel pushes current location into an execution stack
->Kernel jumps out of current memory to a shared memory location
->Kernel executes code from this shared memory location
->Kernel pops back the last memory location and jumps to that address
->If machine code has stop
->Kernel releases memory back to pool
JavaScript/.NET/Perl/Python/PHP/Ruby (Interpretted Languages)
Client request to run application
->Shell informs kernel to run binary
->Kernel has a hook that recognises binary images needs a JIT
->Kernel calls JIT
->JIT loads the code and jumps to a specific address
->JIT reads the code and compiles the instruction into the
machine code that the interpretter is running on
->Interpretture passes machine code to the kernel
->kernel executes the required instruction
->JIT then increments the program counter
->If code has a stop
->Jit releases application from its memory pool
As routeNpingme says, registers are set inside the CPU and the magic happens!
Update: Yeah, I cant speell properly today!
Ok, Answering my own question. This will be done progressively, and only for Linux (and maybe Mach-O). Feel free to add more stuff to your personal answers, so that they get upvoted (and you can get badges, since it's now CW).
I'll start halfway, and build the rest as I find out. This document has been made with a x86_64, gcc (GCC) 4.1.2.
Opening the file, initialization
In this section, we describe what happens when the program is invoked, from the kernel point of view, until the program is ready to be executed.
The ELF is opened.
the kernel looks for the .text section and loads it into memory. Marks it as readonly
the kernel loads the .data section
the kernel loads the .bss section, and initializes all the content to zero.
the kernel transfers the control to the dynamic linker (whose name is inside the ELF file, in the .interp section). The dynamic linker resolves all the shared library calls.
the control is transferred to the application
Execution of the program
the function _start gets invoked, as the ELF header specifies it as the entry point for the executable
_start calls __libc_start_main in glibc (through the PLT) passing the following information to it
the address of the actual main function
the argc address
the argv address
the address of the _init routine
the address of the _fini routine
a function pointer for the atexit() registration
the highest stack address available
_init gets called
calls call_gmon_start to initialize gmon profiling. not really related to execution.
calls frame_dummy, which wraps __register_frame_info(eh_frame section address, bss section address) (FIXME: what does this function do? initializes global vars from the BSS section apparently)
calls __do_global_ctors_aux, whose role is to call all the global constructors listed in the .ctors section.
main gets called
main ends
_fini gets called, which in turns calls __do_global_dtors_aux to run all destructors as specified in the .dtors section.
the program exits.
On Windows, first the image is loaded into memory. The kernel analizes which libraries (read "DLL") it is going to require and loads them up too.
It then edits the program image to insert the memory addresses of each of the library functions it requires. These addresses have a space in the .EXE binary already, but they are just filled with zeros.
Each DLL's DllMain() procedure then gets executed, one by one, from the most required DLL to the last, like following an order of dependences.
Once all libraries were loaded and got ready, finally the image is started, and whatever happens now will depend on language used, compiler used, and the program routine itself.
As soon as the image is loaded into memory, magic takes over.
Well, depending on your exact definition you have to account for JIT compilers for languages like .Net and Java. When you run a .Net "exe" which isn't technically "executable", the JIT compiler steps in and compiles it.

Resources