Memory Segmentation: RPL and DPL in x86 - linux

Understanding Linux kernel (https://www.amazon.in/Understanding-Linux-Kernel-Process-Management-ebook/dp/B0043D2E54) mentions the following:
As stated earlier, the Current Privilege Level of the CPU indicates whether the processor is in User or Kernel Mode and is specified by the RPL field of the Segment Selector stored in the cs register. Whenever the CPL is changed, some segmentation registers must be correspondingly updated. For instance, when the CPL is equal to 3 (User Mode), the ds register must contain the Segment Selector of the user data segment,but when the CPL is equal to 0, the ds register must contain the Segment Selector of the kernel data segment.
A similar situation occurs for the ss register. It must refer to a User Mode stack inside the user data segment when the CPL is 3, and it must refer to a Kernel Mode stack inside the kernel data segment when the CPL is 0. When switching from User Mode to Kernel Mode, Linux always makes sure that the ss register contains the Segment Selector of the kernel data segment.
Based on the above, I have few questions:
1) What are the RPL in the segment selectors stored in the other segmentation registers used for?
2) When a system call is executing on behalf of a user process, the RPL in cs will be set to 3 (Difference between DPL and RPL in x86). In this case will the data segment (ds) contain __USER_DS instead of __KERNEL_DS, and if so how can the implementation of the system call have access to kernel data structures etc?

Related

Linux Segmentation

Recently, I read a book called Understanding the linux kernel. There is a sentence that confused me a lot. Can anybody explain this to me?
As stated earlier, the Current Privilege Level of the CPU indicates
whether the processor is in User or Kernel Mode and is specified by
the RPL field of the Segment Selector stored in the cs register.
Whenever the CPL is changed, some segmentation registers must be
correspondingly updated. For instance, when the CPL is equal to 3
(User Mode), the ds register must contain the Segment Selector of the
user data segment,but when the CPL is equal to 0, the ds register must contain the Segment Selector of the kernel data segment.
A similar situation occurs for the ss register. It must refer to a
User Mode stack inside the user data segment when the CPL is 3, and it
must refer to a Kernel Mode stack inside the kernel data segment when
the CPL is 0. When switching from User Mode to Kernel Mode, Linux
always makes sure that the ss register contains the Segment Selector
of the kernel data segment.
When saving a pointer to an instruction or to a data structure, the
kernel does not need to store the Segment selector component of the
logical address, because the ss register contains the current Segment
Selector.
As an example, when the kernel invokes a function, it executes a call
assembly language instruction specifying just the Offset component of
its logical address; the Segment Selector is implicitly selected as
the one referred to by the cs register. Because there is just one
segment of type “executable in Kernel Mode,” namely the code segment
identified by __KERNEL_CS, it is sufficient to load __KERNEL_CS into
cs whenever the CPU switches to Kernel Mode. The same argument goes
for pointers to kernel data structures (implicitly using the ds
register), as well as for pointers to user data structures (the kernel
explicitly uses the es register).
My understanding is the ss register contains the Segment Selector point to the base of the stack. Does ss register have anything to do with the pointer to an instruction that affects a data structure? If it doesn't, why mention it here?
Finally I make it clear what's the meaning of that paragraph. Actually, this piece of description demonstrates how segmentation works in Linux. It really has implicit objects of comparison--those systems exploit segmentation but not paging. How those systems works? Each process has different segment selectors in their logical address which point to different entries in global descriptor table. Each segment doesn't necessarily need to have the same base. In that case, when you save a pointer to an instruction or data structure, you really have to take care of its segment base. Notice that each logical address has a 16-bit segment selector and a 32-bit offset. If you only save the offset, it's not possible to find that pointer again because there are a lot of different segments in GDT. Things are different when it comes to Linux. All segment selectors have the same base 0. That means a pointer's offset is special enough for picking it up from memory. You might ask, does it work when there are lots of processes running there? It works! Remember each process has its Page Table which has the magical power to map same addresses to different physical addresses. Thanks for all of you who care this question!
My understanding is the ss register contains the Segment Selector
point to the base of the stack.
Right
Does ss register have anything to do with the pointer to an instruction to to a data structure?
No, ss register does not have anything to do with instructions that affect data segment.
If it doesn't, why mention it here?
Because ss register influences the result of instructions that affect the stack (eg : pop, push, etc.).
They are just explaining that Linux maintains also two stack segments (one for user mode, and one for kernel mode) as well as two data segments (one for user mode, and one for kernel mode).
As for the data segment if not updated when switching from user mode to kernel mode, the ss selector would still point to the user stack and the kernel would work with the user stack (would be very bad, right?). So the kernel takes care of updating the ss register as well as the ds register.
NB:
Let's recall that an instruction may access/modify bits in the data segment (mov to a memory address, ) as well as in the stack segment (pop, push, etc.)

How does linux kernel switch between user-mode and kernel-mode stack?

How does linux kernel switch between user-mode and kernel-mode stack when a system call or an interrupt appears? I mean what is the exact mechanism - what happens to user-mode stack pointer and where does kernel-mode stack pointer come from? What is done by hardware and what must be done by software?
All of the words below are about x86.
I will just describe entire syscall path, and this answer will contain requested information.
First of all, you need to understand what is interrupt descriptor table. This table stores addresses of exception/interrupts vectors. System call is an exception. To raise an exception user code perform
int x
assembly instruction. Each exception including system call have its own number. On x86 linux this will be look like
int 0x80
The int instruction is a complex multi step instruction. Here is an explanation of what it does:
1.) Extracts descriptor from IDT (IDT address stored in special register) and checks that CPL <= DPL. CPL is a current privilege level, which could be read from CS register. DPL is stored in the IDT descriptor.
As a consequence of this - you can't generate some exceptions (f.e. page fault) from user space directly by int instruction. If you will try to do this, you will get general protection exception
2.) The processor switches to the stack defined in TSS.
TSS was initialized earlier, and already contains values of ESP and SS, which holds the kernel stack address. So now ESP points to kernel stack.
3.) The processor pushes to the newly switched kernel stack user space registers: ss, esp, eflags, cs, eip. We need to return back after syscall is served, right?
4.) Next processor set CS and EIP from IDT descriptor. This address defines exception vector entry point.
5.) Here we are in the syscall exception vector in kernel.
And few words about ARM. ARM doesn't have TSS, it have bancked per-mode registers. So for SVC and USR modes you have separate stack pointers. If you are interested in you can take look at trap entry code
Interestring links:
MIT JOS lab 3 ,
XV6 manual

Global Descriptor Table and Local Descriptor Table relationship?

Protected-Mode Memory Management
I was going through segmentationof this link.
Both LDT are GDT are independent or dependent on each other ?
(TI bit (which is part of the selector) to decide which descriptor table should be used (the GDT or the currently active LDT) so i think is it independent)
From the figure
Also
GDT (Global Descriptor Table) and is used mainly for holding descriptor entries of operating system segments.
example kernel stack --code_section/data_section?
LDT
The second type is known as the LDT (Local Descriptor Table) and contains entries of normal application segments (although not necessarily)
user stack --code_section/data_section ?
it says the LDTR register contains the size and position of the currently active LDT in memory.
Does that mean in context switch we save the LDTR value of each process in pcb of that process ?

ARM Linux kernel page table

Ref. Linux kernel ARM Translation table base (TTB0 and TTB1)
I have father doubt/query on topic discussed in previous link:
0 to 0xbfffffff is a lower part of memory (for user processes) and managed by the page table in TTB0, it contains the page-table of the current process
Ref. arm/include/asm/pgtable-2level.h : PTRS_PER_PGD =2048, PTRS_PER_PMD =1, PTRS_PER_PTE =512
0xc0000000 to 0xffffffff is upper part (OS and memory-mapped I/O) of the address space managed/translated by the page table in TTBR1.
TTB1 table is fixed in size and alignment (to 16k). Each level 1 entry of size is 32bits and represents 1MB page/segment. This is swapper_pg_dir (ref System.map) page tables that placed 16K below the actual text address
Is that the first 768 entry in swapper_pg_dir = 0 (0x0 to 0xbfffffff for user processes) and valid entry from 768 to 1024(0xc0000000 to 0xffffffff is for OS and memory-mapped I/O)?
Anyone like to share some sample code in kernel space (kernel module) to browse this swapper_pg_dir PGD?
Because of how the ARM MMU was designed, both the translations tables (TTB0 and TTB1) can only be used in a 1:1 mapping kernel mapping.
Most Linux Kernels have a 3:1 mapping (3GB User space : 1GB Kernel space for ARM).
This means that 0-0xBFFFFFFF is user space while 0xC0000000 - 0xFFFFFFFF is kernel space.
Now for the HW memory translations, only TTBR0 is used. TTBR1 only holds the address of the initial swapper page (which contains all the kernel mappings) and isn't really used for virtual address translations. TTBR0 hold the address for the current used page directory (the page table that the HW is using for translations). Now each user process has their own page tables, and for each process switch, TTBR0 changes to the current user process page table (they are all located in kernel space).
For example, for each new user process, the kernel creates a new page directory, copies all the kernel mappings from the swapper page(page frames from 3-4GB) to the new page table and clears the user pages(page frames from 0-3GB). It then sets TTB0 to the base address of this page directory and flushes cache to install the new address space. The swapper page is also always kept up to date with changes to the mappings.
For your question:
Simplified, hardwarewise the first level page have 4096 entries. Each entry represent 1MB of virtual address, totalling 4GB of ram. Entry 0-3071 represent user space and entry 3072-4095 represent kernel space.
The swapper page is usually located at address 0xC0004000 - 0xc0008000 (4096 entries *4bytes each entry = 16384 =16kb in hex = 0x4000 ). By examing the memory at 0xc0004000-0xc0007000 you can find entries for user space (empty) and from 0xc0007000-0xc0008000 you can find kernel entries. I use gdb with the command line x /100x 0xc0007000 in order to examine the first 100 kernel entries. You can then examine the technical reference manual for your current platform in order to decipher the page table attributes.
If you want to learn more about the Linux kernel, I recommend you to use Qemu to simulate the Beagleboard together with gdb to examine and debug the source code. I did this to learn how the kernel builds the page table during initialization.

program life in terms of paged segmentation memory

I have a confusing notion about the process of segmentation & paging in x86 linux machines. Will be glad if some clarify all the steps involved from the start to the end.
x86 uses paged segmentation memory technique for memory management.
Can any one please explain what happens from the moment an executable .elf format file is loaded from hard disk in to main memory to the time it dies. when compiled the executable has different sections in it (text, data, stack, heap, bss). how will this be loaded ? how will they be set up under paged segmentation memory technique.
Wanted to know how the page tables get set up for the loaded program ? Wanted to know how GDT table gets set up. how the registers are loaded ? and why it is said that logical addresses (the ones that are processed by segmentation unit of MMU are 48 bits (16 bits of segment selector + 32 bit offset) when it is a bit 32 bit machine. how will other 16 bits be stored ? any thing accessed from ram must be 32 bits or 4 bytes how does the rest of 16 bits be accessed (to be loaded into segment registers) ?
Thanks in advance. the question can have a lot of things. but wanted to get clarification about the entire life cycle of an executable. Will be glad if some answers and pulls up a discussion on this.
Unix traditionally has implemented protection via paging. 286+ provides segmentation, and 386+ provides paging. Everyone uses paging, few make any real use of segmentation.
In x86, every memory operand has an implicit segment (so the address is really 16 bit selector + 32 bit offset), depending on the register used. So if you access [ESP + 8] the implied segment register is SS, if you access [ESI] the implied segment register is DS, if you access [EDI+4] the implied segment register is ES,... You can override this via segment prefix overrides.
Linux, and virtually every modern x86 OS, uses a flat memory model (or something similar). Under a flat memory model each segment provides access to the whole memory, with a base of 0 and a limit of 4Gb, so you don't have to worry about the complications segmentation brings about. Basically there are 4 segments: kernelspace code (RX), kernelspace data (RW), userspace code (RX), userspace data (RW).
An ELF file consists of some headers that pont to "program segments" and "sections". Section are used for linking. Program segments are used for loading. Program segments are mapped into memory via mmap(), this setups page-table entries with appropriate permissions.
Now, older x86 CPUs' paging mechanism only provided RW access control (read permission implies execute permission), while segmentation provided RWX access control. The end permission takes into account both segmentation and paging (e.g: RW (data segment) + R (read only page) = R (read only), while RX (code segment) + R (read only page) = RX (read and execute)).
So there are some patches that provide execution prevention via segmentation: e.g. OpenWall provided a non-executable stack by shrinking the code segment (the one with execute permission), and having special emulation in the page fault handler for anything that needed execution from a high memory address (e.g: GCC trampolines, self-modified code created on the stack to efficiently implement nested functions).
There's no such thing as paged segmentation, not in the official documentation at least. There are two different mechanisms working together and more or less independently of each other:
Translation of a logical address of the form 16-bit segment selector value:16/32/64-bit segment offset value, that is, a pair of 2 numbers into a 32/64-bit virtual address.
Translation of the virtual address into a 32/64-bit physical address.
Logical addresses is what your applications operate directly with. Then follows the above 2-step translation of them into what the RAM will understand, physical addresses.
In the first step the GDT (or it can be LDT, depends on the selector value) is indexed by the selector to find the relevant segment's base address and size. The virtual address will be the sum of the segment base address and the offset. The segment size and other things in segment descriptors are needed to provide protection.
In the second step the page tables are indexed by different parts of the virtual address and the last indexed table in the hierarchy gives the final, physical address that goes out on the address bus for the RAM to see. Just like with segment descriptors, page table entries contain not only addresses but also protection control bits.
That's about it on the mechanisms.
Now, in many x86 OSes the segment selectors that are used for applications are fixed, they are the same in all of them, they never change and they point to segment descriptors that have base addresses equal to 0 and sizes equal to the possible maximum (e.g. 4GB in non-64-bit modes). Such a GDT setup effectively means that the first step does no useful work and the offset part of the logical address translates into numerically equal virtual address.
This makes the segment selector values practically useless. They still have to be loaded into the CPU's segment registers (in non-64-bit modes into at least CS, SS, DS and ES), but beyond that point they can be forgotten about.
This all (except Linux-related details and the ELF format) is explained in or directly follows from Intel's and AMD's x86 CPU manuals. You'll find many more details there.
Perhaps read the Assembly HOWTO. When a Linux process starts to execute an ELF executable using the execve system call, it is essentially (sort of) mmap-ing some segments (and initializing registers, and a tiny part of the stack). Read also the SVR4 x86 ABI supplement and its x86-64 variant. Don't forget that a Linux process only see memory mapping for its address space and only cares about virtual memory
There are many good books on Operating Systems (=O.S.) kernels, notably by A.Tanenbaum & by M.Bach, and some on the linux kernel
NB: segment registers are nearly (almost) unused on Linux.

Resources