What happens when you run a program? - linux

I would like to collect here what happens when you run an executable on Windows, Linux and OSX. In particular, I would like to understand exactly the order of the operations: my guess is that the executable file format (PE, ELF or Mach-O) is loaded by the kernel (but I ignore the various sections of the ELF(Executable and Linkable Format) and their meaning), and then you have the dynamic linker that resolves the references, then the __init part of the executable is run, then the main, then the __fini, and then the program is completed, but I am sure it's very rough, and probably wrong.
Edit: the question is now CW. I am filling up for linux. If anyone wants to do the same for Win and OSX it would be great.

This is just at a very high and abstract level of course!
Executable - No Shared Libary:
Client request to run application
->Shell informs kernel to run binary
->Kernel allocates memory from the pool to fit the binary image into
->Kernel loads binary into memory
->Kernel jumps to specific memory address
->Kernel starts processing the machine code located at this location
->If machine code has stop
->Kernel releases memory back to pool
Executable - Shared Library
Client request to run application
->Shell informs kernel to run binary
->Kernel allocates memory from the pool to fit the binary image into
->Kernel loads binary into memory
->Kernel jumps to specific memory address
->Kernel starts processing the machine code located at this location
->Kernel pushes current location into an execution stack
->Kernel jumps out of current memory to a shared memory location
->Kernel executes code from this shared memory location
->Kernel pops back the last memory location and jumps to that address
->If machine code has stop
->Kernel releases memory back to pool
JavaScript/.NET/Perl/Python/PHP/Ruby (Interpretted Languages)
Client request to run application
->Shell informs kernel to run binary
->Kernel has a hook that recognises binary images needs a JIT
->Kernel calls JIT
->JIT loads the code and jumps to a specific address
->JIT reads the code and compiles the instruction into the
machine code that the interpretter is running on
->Interpretture passes machine code to the kernel
->kernel executes the required instruction
->JIT then increments the program counter
->If code has a stop
->Jit releases application from its memory pool
As routeNpingme says, registers are set inside the CPU and the magic happens!
Update: Yeah, I cant speell properly today!

Ok, Answering my own question. This will be done progressively, and only for Linux (and maybe Mach-O). Feel free to add more stuff to your personal answers, so that they get upvoted (and you can get badges, since it's now CW).
I'll start halfway, and build the rest as I find out. This document has been made with a x86_64, gcc (GCC) 4.1.2.
Opening the file, initialization
In this section, we describe what happens when the program is invoked, from the kernel point of view, until the program is ready to be executed.
The ELF is opened.
the kernel looks for the .text section and loads it into memory. Marks it as readonly
the kernel loads the .data section
the kernel loads the .bss section, and initializes all the content to zero.
the kernel transfers the control to the dynamic linker (whose name is inside the ELF file, in the .interp section). The dynamic linker resolves all the shared library calls.
the control is transferred to the application
Execution of the program
the function _start gets invoked, as the ELF header specifies it as the entry point for the executable
_start calls __libc_start_main in glibc (through the PLT) passing the following information to it
the address of the actual main function
the argc address
the argv address
the address of the _init routine
the address of the _fini routine
a function pointer for the atexit() registration
the highest stack address available
_init gets called
calls call_gmon_start to initialize gmon profiling. not really related to execution.
calls frame_dummy, which wraps __register_frame_info(eh_frame section address, bss section address) (FIXME: what does this function do? initializes global vars from the BSS section apparently)
calls __do_global_ctors_aux, whose role is to call all the global constructors listed in the .ctors section.
main gets called
main ends
_fini gets called, which in turns calls __do_global_dtors_aux to run all destructors as specified in the .dtors section.
the program exits.

On Windows, first the image is loaded into memory. The kernel analizes which libraries (read "DLL") it is going to require and loads them up too.
It then edits the program image to insert the memory addresses of each of the library functions it requires. These addresses have a space in the .EXE binary already, but they are just filled with zeros.
Each DLL's DllMain() procedure then gets executed, one by one, from the most required DLL to the last, like following an order of dependences.
Once all libraries were loaded and got ready, finally the image is started, and whatever happens now will depend on language used, compiler used, and the program routine itself.

As soon as the image is loaded into memory, magic takes over.

Well, depending on your exact definition you have to account for JIT compilers for languages like .Net and Java. When you run a .Net "exe" which isn't technically "executable", the JIT compiler steps in and compiles it.

Related

Linux kernel loading process

I'm reading about the Linux kernel loading process (just to understand the whole sequence) and I have several doubts specially about the control transition between:
The boot-loader and the kernel
The kernel and the init process
For example, in the wikipedia I found the following:
The kernel as loaded is typically an image file, compressed into either zImage or bzImage formats with zlib. A routine at the head of it does a minimal amount of hardware setup, decompresses the image fully into high memory, and takes note of any RAM disk if configured.[3] It then executes kernel startup via ./arch/i386/boot/head and the startup_32 ()
Here I have several questions:
What this routine stands for?
In which part of the memory is loaded?
Does it already include code to decompress the zImage or this code is loaded separately in another memory location?
I Continue reading on the same page and I found the following:
... start_kernel executes a wide range of initialization functions. It sets up interrupt handling (IRQs), further configures memory, starts the Init process (the first user-space process), ...
I know that the init is the first user-space process created. The answer to the following question:
How the init process is started in linux kernel?
states that the kernel uses a do_execve() call. However, the semantic for the normal execv system call is to override the calling process (the kernel in this case?) bss, data, text and stack segments with the ones from the new process and it doesn't return.
Why in this case it does return? (otherwise, if it doesn't return the kernel wont continue it's starting process)
Thanks in advance,

What is a memory image in *nix systems?

In the book Advanced Programming in the Unix Environment 3rd Edition, Chapter 10 -- Signals, Page 315, when talking about the actions taken by the processes that receive a signal , the author says
When the default action is labeled "terminate+core", it means that a memory image of the process is left in the file named core of the current working directory of the process.
What is a memory image? When is this created, what's the content of it, and what is it used for?
A memory image is simply a copy of the process's virtual memory, saved in a file. It's used when debugging the program, as you can examine the values of the program's variables and determine which functions were being called at the time of the failure.
As the documentation you quoted says, this file is created when the process is terminated due to a signal that has the "terminate+core" default action.'
A memory image is often called a core image. See core(5) and the core dump wikipage.
Grossly speaking, a core image describes the process virtual address space (and content) at time of crash (including call stacks of each active thread and writable data segments for global data and heaps, but often excluding text or code segments which are read-only and given in the executable ELF file or in shared libraries). It also contains the register state (for each thread).
The name core is understandable only by old guys like me (having seen computers built in the 1960 & 1970-s like IBM/360, PDP-10 and early PDP-11, both used for developing the primordial Unix), since long time ago (1950-1970) random access memory was made with magnetic core memory.
If you have compiled all your source code with debug information (e.g. using gcc -g -Wall) you can do some post-mortem debugging (after yourprogram crashed and dumped a core file!) using gdb as
gdb yourprogram core
and the first gdb command you'll try is probably bt to get the backtrace.
Don't forget to enable core dumps, with the setrlimit(2) syscall generally done in your shell with e.g. ulimit  -c
Several signals can dump core, see signal(7). A common cause is a segmentation violation, like when you dereference a NULL or bad pointer, which gives a SIGSEGV signal which (often) dumps a core file in the current directory.
See also gcore(1).

Is it possible to force a range of virtual addresses?

I have an Ada program that was written for a specific (embedded, multi-processor, 32-bit) architecture. I'm attempting to use this same code in a simulation on 64-bit RHEL as a shared object (since there are multiple versions and I have a requirement to choose a version at runtime).
The problem I'm having is that there are several places in the code where the people who wrote it (not me...) have used Unchecked_Conversions to convert System.Addresses to 32-bit integers. Not only that, but there are multiple routines with hard-coded memory addresses. I can make minor changes to this code, but completely porting it to x86_64 isn't really an option. There are routines that handle interrupts, CPU task scheduling, etc.
This code has run fine in the past when it was statically-linked into a previous version of the simulation (consisting of Fortran/C/C++). Now, however, the main executable starts, then loads a shared object based on some inputs. This shared object then checks some other inputs and loads the appropriate Ada shared object.
Looking through the code, it's apparent that it should work fine if I can keep the logical memory addresses between 0 and 2,147,483,647 (32-bit signed int). Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code
The good news is that the loader will leave the lower ranges untouched.
The bad news is that it will not load any shared object there. There is no interface you could use to influence placement of shared objects.
That said, dlopen from memory (which we implemented in our private fork of glibc) would allow you to do that. But that's not available publicly.
Your other possible options are:
if you can fit the entire process into 32-bit address space, then your solution is trivial: just build everything with -m32.
use prelink to relocate the library to desired address. Since that address should almost always be available, the loader is very likely to load the library exactly there.
link the loader with a custom mmap implementation, which detects the library of interest through some kind of side channel, and does mmap syscall with MAP_32BIT set, or
run the program in a ptrace sandbox. Such sandbox can again intercept mmap syscall, and or-in MAP_32BIT when desirable.
or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
I don't see how that's possible. If the library stores an address of a function or a global in a 32-bit memory location, then loads that address and dereferences it ... it's going to get a 32-bit truncated address and a SIGSEGV on dereference.

Is a linux application loaded in the loader's address space?

When ld-linux (Linux's loader) loads an application, it loads its ELF data structures to memory, builds some structures (e.g., GOT), and passes the execution to the entry point of the loaded application.
Is the loading of this application's code and data done into the loader's address space? Does the execution of the application's code occur in the loader's address space?
If not, what is the mechanism ld-linux uses to pass the execution to the loaded instructions?
Answer (EDIT): The application's code is loaded into the loader's address space. The application code and loader are ran on the same address space.
http://grahamwideman.wordpress.com/2009/02/09/the-linux-loader-and-how-it-finds-libraries/ http://www.tenouk.com/ModuleW.html basically there are assemblers and linkers too.The hiearchy of ld-linux (loader's linux is very well explained in the second url.
Thanks & Regards,
Alok

shared object file path that is executed by the current thread

Is there a way to get the file path/file name of the .so that is currently under execution by a thread? The program is written in c++ and run on a 64-bit Linux 3.0 machine.
You can sequentially read (from inside your process) the file /proc/self/maps to get the memory mapping (including those for shared objects) of the current process.
Then you could get your program counter (or those of the caller) and find in which segment it is. Perhaps backtrace or GCC builtin_return_address is relevant.
You might also use the dladdr function.
See proc(5), backtrace(3), dladdr(3) man pages, and also this answer.
addenda
From a signal handler you could get the program counter when the signal was sent by using sigaction(2) with SA_SIGINFO. The sa_sigaction function pointers gets a ucontext_t from which you could get the program counter register (using machine dependent C code). Then you could handle it.
I suggest looking into detail into what GCC is doing
I think the closes thing is to get a list of all shared libraries loaded by your process. You can do that either by using pmap or lsof.

Resources