Xen binary rewriting method - linux

In full virtualization, what is the CPL of guest OS?
in paravertualiation, CPL of guest OS is 1(ring 1)
is it same in full virtualization?
and I heard that some of the x86 privileged instructions are
not easily handled, thus "binary rewriting" method is required...
how does this "binary rewriting" happens??
I understand that in virtualization, CPU is not emulated.
so how can hypervisor change the binary instruction codes before
the CPU executes them?? do they predict the next instruction on memory and
update the memory contents before CPU gets there??
if this is true, I think hypervisor code(performing binary rewriting)
needs to intercept the CPU every time before some instruction of guest OS is
executed. I think this is absurd.
specific explanation will be appreciated.
thank you in advance..!!

If by full virtualization, you mean hardware supported virtualization, then the CPL of the guest is identical to if it was running on bare-metal.
Xen never rewrites the binary.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".
In Xen, using paravirtualization (ring compression), then the code in the OS has been modified to be aware of the virtualized environment, and as such is "trusted" to understand certain things. But the hypervisor will still trap for example writes to the page-table (otherwise someone could write a malicious kernel module that modifies the page-table to map in another guest's memory, or some such).
The HVM method does intercept CERTAIN instructions - but the rest of the code runs at normal full speed, thanks to the hardware support in modern processors, such as SVM in AMD and VMX in Intel processors. ARM has a similar technology in the latest models of their processors, but I'm not sure what the name of it is.
I'm not sure if I've answered quite all of your questions, if I've missed something, or it's not clear enough, feel free to ask...

Related

how do VM's virtualize HW

Suppose I have a machine running Mac OS X, which is running VMware, which is running Ubuntu, which is running the canonical helloworld.c in a shell. What are the high-level sequence of events that occur between me pressing enter and Hello World! popping up on my screen?
I can understand that everything sitting above Ubuntu acts obliviously to the virtualization occurring. Additionally, I can somewhat understand from the point of view of Mac OS X, VMware is just another program - nothing special there. However, I don't understand how Ubuntu thinks it's interacting with HW, especially if it's not actually running in kernal-mode?
I'm just learning about OS's - so may not understand the full picture. Is there an additional sw/fw layer underneath the OS which the hypervisor emulates?
What 'Ubuntu' is (or any other application) is a set of bytes that either represent instructions (opcodes long with their arguemnts) or data.
The instructions are decoded and executed by the CPU. The data is mostly read into the memory (lets say a group of constants, static variables, etc.).
VMware is basically a virtual computer hardware platform (here it's a virtualization of the x86 platform). This means that it reads all the bytes of an application (a raw binary, a PE or ELF exec, whatever) and tries to act as an x86 CPU. If done properly this is indistinguishable to anything interpreted by it.
This isn't abstraction - it doesn't hide the communication method with the hardware abstracting it to some higher-level access method (like the Linux filesystem for example). It just tries to act like a x86 CPU the best it can, an abstraction would be a clearly visible layer.
As an example - C is an abstraction over ASM/machine language, you can tell the difference between them quite clearly.

Why do we need a bootloader in an embedded device?

I'm working with ELinux kernel on ARM cortex-A8.
I know how the bootloader works and what job it's doing. But i've got a question - why do we need bootloader, why was the bootloader born?
Why we can't directly load the kernel into RAM from flash memory without bootloader? If we load it what will happen? In fact, processor will not support it, but why are we following the procedure?
In the context of Linux, the boot loader is responsible for some predefined tasks. As this question is arm tagged, I think that ARM booting might be a useful resource. Specifically, the boot loader was/is responsible for setting up an ATAG list that describing the amount of RAM, a kernel command line, and other parameters. One of the most important parameters is the machine type. With device trees, an entire description of the board is passed. This makes a stock ARM Linux impossible to boot with out some code to setup the parameters as described.
The parameters allows one generic Linux to support multiple devices. For instance, an ARM Debian kernel can support hundreds of different board types. Uboot or other boot loader can dynamically determine this information or it can be hard coded for the board.
You might also like to look at bootloader info page here at stack overflow.
A basic system might be able to setup ATAGS and copy NOR flash to SRAM. However, it is usually a little more complex than this. Linux needs RAM setup, so you may have to initialize an SDRAM controller. If you use NAND flash, you have to handle bad blocks and the copy may be a little more complex than memcpy().
Linux often has some latent driver bugs where a driver will assume that a clock is initialized. For instance if Uboot always initializes an Ethernet clock for a particular machine, the Linux Ethernet driver may have neglected to setup this clock. This can be especially true with clock trees.
Some systems require boot image formats that are not supported by Linux; for example a special header which can initialize hardware immediately; like configuring the devices to read initial code from. Additionally, often there is hardware that should be configured immediately; a boot loader can do this quickly whereas the normal structure of Linux may delay this significantly resulting in I/O conflicts, etc.
From a pragmatic perspective, it is simpler to use a boot loader. However, there is nothing to prevent you from altering Linux's source to boot directly from it; although it maybe like pasting the boot loader code directly to the start of Linux.
See Also: Coreboot, Uboot, and Wikipedia's comparison. Barebox is a lesser known, but well structured and modern boot loader for the ARM. RedBoot is also used in some ARM systems; RedBoot partitions are supported in the kernel tree.
A boot loader is a computer program that loads the main operating system or runtime environment for the computer after completion of the self-tests.
^ From Wikipedia Article
So basically bootloader is doing just what you wanted - copying data from flash into operating memory. It's really that simple.
If you want to know more about boostrapping the OS, I highly recommend you read the linked article. Boot phase consists, apart from tests, also of checking peripherals and some other things. Skipping them makes sense only on very simple embedded devices, and that's why their bootloaders are even simpler:
Some embedded systems do not require a noticeable boot sequence to begin functioning and when turned on may simply run operational programs that are stored in ROM.
The same source
The primary bootloader is usually built in into the silicon and performs the load of the first USER code that will be run in the system.
The bootloader exists because there is no standardized protocol for loading the first code, since it is chip dependent. Sometimes the code can be loaded through a serial port, a flash memory, or even a hard drive. It is bootloader function to locate it.
Once the user code is loaded and running, the bootloader is no longer used and the correctness of the system execution is user responsibility.
In the embedded linux chain, the primary bootloader will setup and run the Uboot. Then Uboot will find the linux kernel and load it.
Why we can't directly load the kernel into RAM from flash memory without bootloader? If we load it what will happen? In fact, processor will not support it, but why are we following the procedure?
Bartek, Artless, and Felipe all give parts of the picture.
Every embedded processor type (E.G. 386EX, Coretex-A53, EM5200) will do something automatically when it is reset or powered on. Sometimes that something is different depending on whether the power is cycled or the device is reset. Some embedded processors allow you to change that something based on voltages applied to different pins when the device is powered or reset.
Regardless, there is a limited amount of something that a processor can do, because of the physical space on-processor required to define that something, whether it is on-chip FLASH, instruction micro-code, or some other mechanism.
This limit means that the something is
fixed purpose, does one thing as quickly as possible.
limited in scope and capability, typically loading a small block of code (often a few kilobytes or less) into a fixed memory location and executing from the start of the loaded code.
unmodifiable.
So what a processor does in response to reset or power-cycle cannot be changed, and cannot do very much, and we don't want it to automatically copy hundreds of megabytes or gigabytes into memory which may not exist or may not be initialized, and which could take a looooong time.
So....
We set up a small program which is smaller than the smallest size permitted across all of the devices we are going to use. That program is stored wherever the something needs it to be.
Sometimes the small program is U-Boot. Sometimes even U-Boot is too big for initial load, so the small program then in turn loads U-Boot.
The point is that whatever gets loaded by the something, is modifiable as needed for a particular system. If it is U-Boot, great, if not, it knows where to load the main operating system or where to load U-Boot (or some other bootloader).
U-Boot (speaking of bootloaders in general) then configures a minimal set of devices, memory, chip settings, etc., to enable the main OS to be loaded and started. The main OS init takes care of any additional configuration or initialization.
So the sequence is:
Processor power-on or reset
Something loads initial boot code (or U-Boot style embedded bootloader)
Initial boot code (may not be needed)
U-Boot (or other general embedded bootloader)
Linux init
The kernel requires the hardware on which you are working to be in a particular state. All the hardware you used needs to be checked for its state and initialized for its further operation. This is one of the main reasons to use a boot loader in an embedded (or any other environment), apart from its use to load a kernel image into the RAM.
When you turn on a system, the RAM is also not in a useful state (fully initialized to use) for us to load kernel into it. Therefore, we cannot load a kernel directly (to answer your question)and thus arises the need for a construct to initialize it.
Apart from what is stated in all the other answers - which is correct - in some cases the system has to go through different execution modes, take as example TrustZone for secure ARM chips. It is possible to still consider it as sort of HW initialization, but what makes it peculiar is the fact that there are additional limitations (ex: memory available) that make it impractical, if not impossible, to do everything in a single binary, thus multiple stages of bootloader are available.
Furthermore, for security reason, each of them is signed and can perform its job only if it meets the security requirements.

Monitoring the instructions of a running program in ubuntu?

I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.
I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)

Address space identifiers using qemu for i386 linux kernel

Friends, I am working on an in-house architectural simulator which is used to simulate the timing-effect of a code running on different architectural parameters like core, memory hierarchy and interconnects.
I am working on a module takes the actual trace of a running program from an emulator like "PinTool" and "qemu-linux-user" and feed this trace to the simulator.
Till now my approach was like this :
1) take objdump of a binary executable and parse this information.
2) Now the emulator has to just feed me an instruction-pointer and other info like load-address/store-address.
Such approaches work only if the program content is known.
But now I have been trying to take traces of an executable running on top of a standard linux-kernel. The problem now is that the base kernel image does not contain the code for LKM(Loadable Kernel Modules). Also the daemons are not known when starting a kernel.
So, my approach to this solution is :
1) use qemu to emulate a machine.
2) When an instruction is encountered for the first time, I will parse it and save this info. for later.
3) create a helper function which sends the ip, load/store address when an instruction is executed.
i am stuck in step2. how do i differentiate between different processes from qemu which is just an emulator and does not know anything about the guest OS ??
I can modify the scheduler of the guest OS but I am really not able to figure out the way forward.
Sorry if the question is very lengthy. I know I could have abstracted some part but felt that some part of it gives an explanation of the context of the problem.
In the first case, using qemu-linux-user to perform user mode emulation of a single program, the task is quite easy because the memory is linear and there is no virtual memory involved in the emulator. The second case of whole system emulation is a lot more complex, because you basically have to parse the addresses out of the kernel structures.
If you can get the virtual addresses directly out of QEmu, your job is a bit easier; then you just need to identify the process and everything else functions just like in the single-process case. You might be able to get the PID by faking a system call to get_pid().
Otherwise, this all seems quite a bit similar to debugging a system from a physical memory dump. There are some tools for this task. They are probably too slow to run for every instruction, though, but you can look for hints there.

Would executable files be Machine Code - made for the hardware?

Here is from Wiki .
"In computing, an executable file causes a computer "to perform indicated tasks according to encoded instructions," ( Machine Code ?? )
"Modern operating systems retain control over the computer's resources, requiring that individual programs make system calls to access privileged resources. Since each operating system family features its own system call architecture, executable files are generally tied to specific operating systems."
Well this is my perspective .
Executables cannot be Machine Code as they need to tal to the OS for hardware services ( system calls) Hence executable is just not yet "Machine Code" ... Perhaps it is like some part of the code is actual Machine Code and some parts are just meant to call the Machine code embedded in the Operating system ? Overall it contains some junks of Machine Code - and some junks of codes to call the operating system .
Edited after Damon's Answer :
In the end OS is a set of machine codes . Basically OS would be doing the job of copy pasting user's Machine Code ( created by C Compiler ) and then if the instruction is a system call , the transfer goes to OS memory region for handling it . Now the question is what Machine Code generated in C can do this part ? Like asking to transfer control to OS etc - I suppose its system calls at higher abstraction but under the hood - how does it work .
I get a feeling its similar to chicken egg problem , C creates OS and C uses OS Cant find the exactly how the process goes .
Can anyone break the puzzle for me ?
One thing does not exclude the other. Executables are (unless they are some form of bytecode running in a virtual machine) machine code. However, there are different kinds of instructions, some of which are not usable at certain privilegue levels.
That is where the operating system comes in, it is "machine code" that runs at the highest privilegue level, working as arbiter for the "important" parts and tasks, such as deciding who gets CPU time and what value goes into some hardware register.
(originally comment, made an answer by request)
EDIT: About your extended question, this works approximately as follows. When the computer is turned on, the processor runs at its highest privilegue level. In this "mode", the BIOS, the boot loader, and the operating system can do just what they want. This sounds great, but you don't want any kind of code being able to do just whatever it wants.
For example, the code can tell the MMU which memory pages are allowed to be read or written to, and which ones are not. Or, it can define what address is called if "something special" such as a trap or interrupt happens. Or, it can directly write to some special memory addresses that map ports of some devices (disk, network, whatever).
Eventually, the OS switches to "unprivileged" mode and calls some non-OS code. When a trap or interrupt happens, execution is interrupted and continues elsewhere (as specified by the OS previously), and the privilege level is upped again. Once the interrupt has been dealt with, privilege is taken away, and user code is called again.
If a user program needs the OS to do something "OS like", it sets up parameters according to an agreed scheme (for example in some particular registers) and executes a trap instruction.
This is for example how things like multithreading or virtual memory are implemented. In regular intervals, a timer fires off an interrupt, which stops execution of "normal" code, and calls some code in the kernel (in privileged mode). That code then decides what user process control should returned to, after some kind of priority scheme. Those are the "CPU time slices" that are handed out.
If some process reads from or writes to a page that it isn't allowed, a trap is generated by the MMU. The OS then looks at what happened and where, and decides whether to load some data from disk into some memory region (and possibly purge something else) and change the process' mappings, or whether to kill the process with a "segmentation fault" error.
Of course in reality, it is a million times more complicated, but in principle that's about as it works.
It does not really matter whether the OS or the programs were originally written in C or with an assembler. To the processor, it's just a sequence of machine instructions. Even a python or perl script is "just machine instructions" in the end, only with a detour via the interpreter.

Resources