Why Vacant locations in the address space are never accessible in RISC-V ISA? - riscv

I am reading The RISC-V Instruction Set Manual Volume I: Unprivileged ISA and i encountered a very strange statement to me. In subsection 1.4 Memory at the end of fifth paragraph is stated "Vacant locations in the address space are never accessible."
I am a bit confused and searched about that but i couldn't come up with any conclusion. It would be appreciated if you share your idea with me.

Why Vacant locations in the address space are never accessible in RISC-V ISA?
This is by definition.  From the second paragraph in section 1.4:
Different address ranges of a hart’s address space may (1) be vacant, or (2) contain main memory, or (3) contain one or more I/O devices.
So, this is saying that there may be address ranges that have neither memory nor I/O devices: and these are called vacant locations.
Ordinarily, if an instruction attempts to access memory at an inaccessible address, an exception is raised for the instruction.  Vacant locations in the address space are never accessible.
And further that such vacant locations are inaccessible meaning that attempting to access them will fault (cause an exception).
The hart is (programmable by the operating system to be) aware of three kinds of address ranges: main memory, I/O devices, and vacant (i.e. neither).  Attempts to execute a load or a store at a vacant location will fault because there's nothing there to access, and the hart knows this (i.e. it has been told what ranges are valid/invalid).

Related

Are Qemu guest memory accesses trapped somewhere in the code?

According to https://qemu.readthedocs.io/en/latest/devel/memory.html#visibility :
The memory core uses the following rules to select a memory region when the guest accesses an address:
all direct subregions of the root region are matched against the address, in descending priority order
if the address lies outside the region offset/size, the subregion is discarded
if the subregion is a leaf (RAM or MMIO), the search terminates, returning this leaf region
if the subregion is a container, the same algorithm is used within the subregion (after the address is adjusted by the subregion offset)
if the subregion is an alias, the search is continued at the alias target (after the address is adjusted by the subregion offset and alias offset)
if a recursive search within a container or alias subregion does not find a match (because of a “hole” in the container’s coverage of its address range), then if this is a container with its own MMIO or RAM backing the search terminates, returning the container itself. Otherwise we continue with the next subregion in priority order
if none of the subregions match the address then the search terminates with no match found
Does this process happen on every memory access by the guest? If so, where is this logic in the Qemu codebase, roughly?
No, it doesn't happen for every access; the documented rules above describe the observed behaviour rather than the implementation, because their assumed audience is a developer writing a board or SoC model, who doesn't need to know the internal implementation details of how exactly the memory subsystem uses the tree of MemoryRegions that the board and SoC code creates.
Firstly, once the tree of MemoryRegions has been built, it is analysed to produce a data structure called a FlatView. We identify (using the rules above) what leaf MemoryRegion would be hit for each part of the address space, and create the FlatView, which is basically a list of ranges, so it might say "for addresses 0 to 0x8000, MemoryRegion 1; for addresses 0x8000 to 0x10000, MemoryRegion 2", and so on. (The details are a little more complicated.) Once the FlatView has been created, memory accesses can be done quickly because looking up the address in the FlatView to get the relevant MemoryRegion is fast. This code path gets used for memory accesses when using KVM, and for when devices do DMA to/from memory.
Secondly, when TCG emulation does a memory access for a guest address, on the first time around it has to take a slow path, but it will cache the resulting MemoryRegion or host RAM address in the QEMU TLB[*]. Then subsequent accesses to that page of the guest address space will be fast. In particular, for accesses to RAM which is backed by host RAM, the access is done entirely in code generated by the TCG JIT, and never has to come out to a C function.
Most of the code that creates and works with the FlatView is in softmmu/physmem.c and softmmu/memory.c. The code that works with the TLB is in accel/tcg/cputlb.c. The code that generates the inline sequences for the JIT fastpath is under tcg/.
[*] The QEMU TLB is similar in purpose to a hardware TLB, in that it speeds up lookups that start with a guest address, but it is not modelling the guest CPU's TLB. This question and answer have more details.

How to test if address is virtual or logical in linux kernel?

I am familiar that the Linux kernel memory is usually 1:1 mapped (up to a certain limit of the zone). From what I understood is that to make this 1:1 mapping more efficient the array of struct page is virtually mapped.
I wanted to check if that is the case. Is there a way to test if - given an address (lets say that of a struct page) check if it is 1:1 mapped or virtually mapped?
The notion of address space for a 64-bit machine emcompasses 2^64 addresses. This is far larger than any modern amount of physical memory in one machine. Therefore, it is possible to have the entire physical memory mapped into the address space with plenty of room to spare. As discussed in this post and shown here, Linux leaves 64 TB of the address space for the physical mapping. Therefore, if the kernel needed to iterate through all bytes in physical memory, it could just iterate through addresses 0+offset to total_bytes_of_RAM + offset, where offset is the address where the direct mapping starts (ffff888000000000 in the 64 bit memory layout linked above). Also, this direct mapping region is within the kernel address range that is "shared between all processes" so addresses in this range should always be logical.
Your post has two questions: one is how to test if an address is logical or virtual. As I mentioned, the answer is if the address falls within the direct mapping range, then it is logical. Otherwise it is virtual. If it is a virtual address, then obtaining the physical address through the page tables should allow you to access the address logically by following the physical_addr + offset math as mentioned above.
Additionally, kmalloc allocates/reserves memory directly using this logical mapping, so you immediately know that if the address you're using came from kmalloc, it is a logical address. However, vmalloc and any user-space memory allocations use virtual addresses that must be translated to get the logical equivalent.
Your second question is whether "logically mapped pages" can be swapped out. The question should be rephrased because technically all pages that are in RAM are logically mapped in that direct mapping region. And yes certain pages in main memory can be swapped out or kicked out to be used by another page in the same page frame. Now, if you're asking whether pages that are only mapped logically and not virtually (like with kmalloc, which gets memory from slab) can be swapped out, I think the answer is that they can be reclaimed if not being used, but aren't generally swapped out. Kernel pages are generally not swapped out, except for hibernation.

Reading a value in a physical address via the kernel

I'm working on an old linux operating system which has one kernel for all processes (it basically an exo-kernel type).
While implementing debugging features from user space, I would like to disassemble other's processes commands. Therefore, I have created a system-call which takes the virtual address at the target process and prints it's value in it (so I can disassemble the bytes).
My idea was to switch to the target's pgdir, call a pagewalk and then access the data in the physical address pointer. I get a kernel panic while trying to access the later.
If I'm switching to the target's process and then access the virtual address (without pagewalk), the bytes of the command are printed without any problem (with printf("%04x", *va) for example).
My question is - why does the virtual address contain the actual command but the physical address don't (and why does it panic?)
Thank you!
Note: This is an XY-answer ... I'm aware I'm not answering your question ('how to twiddle with hardware MMU setup to read ... memory somewhere') but I'm suggesting a solution to your stated problem (how to read from another process' address space).
Linux provides a facility to do what you ask for - read memory from another process' address space - via the use of ptrace(),
PTRACE_PEEKTEXT, PTRACE_PEEKDATA
Read a word at the address addr in the tracee's memory, returning
the word as the result of the ptrace() call. Linux does not have
separate text and data address spaces, so these two requests are
currently equivalent. (data is ignored; but see NOTES.)
https://stackoverflow.com/search?q=ptrace+PTRACE_PEEKDATA for some references ?

How exactly do kernel virtual addresses get translated to physical RAM?

On the surface, this appears to be a silly question. Some patience please.. :-)
Am structuring this qs into 2 parts:
Part 1:
I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).
Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.
HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:
char *kptr = kmalloc(1024, GFP_KERNEL);
Now kptr is a kernel virtual address.
I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address.
But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU).
Is this actually the case??
Part 2:
Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.
(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).
If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?
Eg.
On an x86_64, 2 processes A and B are alive, 1 cpu.
A is running, so it's higher-canonical addr
0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr
0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.
Now, if we context-switch A->B, process B's lower-canonical region is unique But
it must "map" to the same kernel of course!
How exactly does this happen? How do we "auto" refer to the kernel paging table when
in kernel mode? Or is that a wrong statement?
Thanks for your patience, would really appreciate a well thought out answer!
First a bit of background.
This is an area where there is a lot of potential variation between
architectures, however the original poster has indicated he is mainly
interested in x86 and ARM, which share several characteristics:
no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)
So if we restrict ourselves to those systems it keeps things simpler.
Once the MMU is enabled, it is never normally turned off. So all CPU
addresses are virtual, and will be translated to physical addresses
using the MMU. The MMU will first look up the virtual address in the
TLB, and only if it doesn't find it in the TLB will it refer to the
page table - the TLB is a cache of the page table - and so we can
ignore the TLB for this discussion.
The page table
describes the entire virtual 32 or 64 bit address space, and includes
information like:
whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use
Linux divides the virtual address space into two: the lower portion is
used for user processes, and there is a different virtual to physical
mapping for each process. The upper portion is used for the kernel,
and the mapping is the same even when switching between different user
processes. This keep things simple, as an address is unambiguously in
user or kernel space, the page table doesn't need to be changed when
entering or leaving the kernel, and the kernel can simply dereference
pointers into user space for the
current user process. Typically on 32bit processors the split is 3G
user/1G kernel, although this can vary. Pages for the kernel portion
of the address space will be marked as accessible only when the processor
is in kernel mode to prevent them being accessible to user processes.
The portion of the kernel address space which is identity mapped to RAM
(kernel logical addresses) will be mapped using big pages when possible,
which may allow the page table to be smaller but more importantly
reduces the number of TLB misses.
When the kernel starts it creates a single page table for itself
(swapper_pg_dir) which just describes the kernel portion of the
virtual address space and with no mappings for the user portion of the
address space. Then every time a user process is created a new page
table will be generated for that process, the portion which describes
kernel memory will be the same in each of these page tables. This could be
done by copying all of the relevant portion of swapper_pg_dir, but
because page tables are normally a tree structures, the kernel is
frequently able to graft the portion of the tree which describes the
kernel address space from swapper_pg_dir into the page tables for each
user process by just copying a few entries in the upper layer of the
page table structure. As well as being more efficient in memory (and possibly
cache) usage, it makes it easier to keep the mappings consistent. This
is one of the reasons why the split between kernel and user virtual
address spaces can only occur at certain addresses.
To see how this is done for a particular architecture look at the
implementation of pgd_alloc(). For example ARM
(arch/arm/mm/pgd.c) uses:
pgd_t *pgd_alloc(struct mm_struct *mm)
{
...
init_pgd = pgd_offset_k(0);
memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
...
}
or
x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
references from swapper_pg_dir. */
...
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
...
}
So, back to the original questions:
Part 1: Are kernel virtual addresses really translated by the TLB/MMU?
Yes.
Part 2: How is swapper_pg_dir "attached" to a user mode process.
All page tables (whether swapper_pg_dir or those for user processes)
have the same mappings for the portion used for kernel virtual
addresses. So as the kernel context switches between user processes,
changing the current page table, the mappings for the kernel portion
of the address space remain the same.
The kernel address space is mapped to a section of each process for example on 3:1 mapping after address 0xC0000000. If the user code try to access this address space it will generate a page fault and it is guarded by the kernel.
The kernel address space is divided into 2 parts, the logical address space and the virtual address space. It is defined by the constant VMALLOC_START. The CPU is using the MMU all the time, in user space and in kernel space (can't switch on/off).
The kernel virtual address space is mapped the same way as user space mapping. The logical address space is continuous and it is simple to translate it to physical so it can be done on demand using the MMU fault exception. That is the kernel is trying to access an address, the MMU generate fault , the fault handler map the page using macros __pa , __va and change the CPU pc register back to the previous instruction before the fault happened, now everything is ok. This process is actually platform dependent and in some hardware architectures it mapped the same way as user (because the kernel doesn't use a lot of memory).

Why is the ELF execution entry point virtual address of the form 0x80xxxxx and not zero 0x0?

When executed, program will start running from virtual address 0x80482c0. This address doesn't point to our main() procedure, but to a procedure named _start which is created by the linker.
My Google research so far just led me to some (vague) historical speculations like this:
There is folklore that 0x08048000 once was STACK_TOP (that is, the stack grew downwards from near 0x08048000 towards 0) on a port of *NIX to i386 that was promulgated by a group from Santa Cruz, California. This was when 128MB of RAM was expensive, and 4GB of RAM was unthinkable.
Can anyone confirm/deny this?
As Mads pointed out, in order to catch most accesses through null pointers, Unix-like systems tend to make the page at address zero "unmapped". Thus, accesses immediately trigger a CPU exception, in other words a segfault. This is quite better than letting the application go rogue. The exception vector table, however, can be at any address, at least on x86 processors (there is a special register for that, loaded with the lidt opcode).
The starting point address is part of a set of conventions which describe how memory is laid out. The linker, when it produces an executable binary, must know these conventions, so they are not likely to change. Basically, for Linux, the memory layout conventions are inherited from the very first versions of Linux, in the early 90's. A process must have access to several areas:
The code must be in a range which includes the starting point.
There must be a stack.
There must be a heap, with a limit which is increased with the brk() and sbrk() system calls.
There must be some room for mmap() system calls, including shared library loading.
Nowadays, the heap, where malloc() goes, is backed by mmap() calls which obtain chunks of memory at whatever address the kernel sees fit. But in older times, Linux was like previous Unix-like systems, and its heap required a big area in one uninterrupted chunk, which could grow towards increasing addresses. So whatever was the convention, it had to stuff code and stack towards low addresses, and give every chunk of the address space after a given point to the heap.
But there is also the stack, which is usually quite small but could grow quite dramatically in some occasions. The stack grows down, and when the stack is full, we really want the process to predictably crash rather than overwriting some data. So there had to be a wide area for the stack, with, at the low end of that area, an unmapped page. And lo! There is an unmapped page at address zero, to catch null pointer dereferences. Hence it was defined that the stack would get the first 128 MB of address space, except for the first page. This means that the code had to go after those 128 MB, at an address similar to 0x080xxxxx.
As Michael points out, "losing" 128 MB of address space was no big deal because the address space was very large with regards to what could be actually used. At that time, the Linux kernel was limiting the address space for a single process to 1 GB, over a maximum of 4 GB allowed by the hardware, and that was not considered to be a big issue.
Why not start at address 0x0? There's at least two reasons for this:
Because address zero is famously known as a NULL pointer, and used by programming languages to sane check pointers. You can't use an address value for that, if you're going to execute code there.
The actual contents at address 0 is often (but not always) the exception vector table, and is hence not accessible in non-privileged modes. Consult the documentation of your specific architecture.
As for the entrypoint _start vs main:
If you link against the C runtime (the C standard libraries), the library wraps the function named main, so it can initialize the environment before main is called. On Linux, these are the argc and argv parameters to the application, the env variables, and probably some synchronization primitives and locks. It also makes sure that returning from main passes on the status code, and calls the _exit function, which terminates the process.

Resources