nopage () method implementation - linux

Any one know about how virtual address is translated to physical address in no page method.
with reference to Device Drivers book the nopage method is given as ,
struct page *simple_vma_nopage(struct vm_area_struct *vma,
unsigned long address, int *type)
{
struct page *pageptr;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long physaddr = address - vma->vm_start + offset;
unsigned long pageframe = physaddr >> PAGE_SHIFT;
if (!pfn_valid(pageframe))
return NOPAGE_SIGBUS;
pageptr = pfn_to_page(pageframe);
get_page(pageptr);
if (type)
*type = VM_FAULT_MINOR;
return pageptr;
}
page_shift is the number of bits used to reperesent offset for Virtual and physical memory address.
But what is the offset variable ?
How a physical address is calculated from arithmetic operations on virtual address variables like address and vm_start ?

I feel the documentation of vm_pgoff is not very clear.
This is the offset of the first page of the memory region in RAM.
So if our RAM begins at 0x00000000, and our memory region begins
at 0x0000A000, then vm_pgoff = 10. If you consider/ revisit the mmap
system call then you can see that the "offset" which we pass is the offset
of the starting byte in the file from which "length" bytes will be mapped
on to the memory region. This offset can be converted to address by left
shifting it to PAGE_SHIFT value which is 12 (i.e. 4KB per page size)
Now, irrespective of whether the cr3 register is used in linear address to
physical address translation or not, when we say that "address - vm_start"
then this gives the size of portion between the addresses.
example:
vm_start = 0xc0080000
address = 0xc0090000
address - vm_start = 0x00010000
physaddr = (address - vma->vm_start) + offset;
= 0x00010000 + (10 << PAGE_SHIFT)
= offset_to_page_that_fault + start_addr_of_memoryRegion_in_RAM
= 0x00010000 + 0x0000A000
= 0x0001A000
Now since this is the physical address therefore we need to convert to page frame
number by right shifting by PAGE_SHIFT value i.e 0x0001A000 >> 12 = 0x1A = 26 (decimal)
Therefore the 26th page-frame must be loaded with the data from the file which is being mapped.
Hence data is retrieved from the disk by using the inode's struct address_sapce
which contains the information of the location of page on the disk (swap space).
Once the data is brought in we return the struct page which represents this data in the
page_frame for which this page fault occurred. We return this to the user.
This is my understanding of the latest but I haven't tested it.

No, the statement in the book is correct, because
As aforementioned, "physical" is just the address
Of starting of your region/portion that you want
To map out of the physical memory which starts
From "off" physical address till the "simple_region_size"
The "simple_region_size" value is decided by the user.
Similarly "simple_region_start" is decided by the user.
simple_region_start >= off
So the maximum physical memory that user can map
Is decided by: psize = simple_region_size - off
I.e from start of physical memory till end
of the portion.
But actually how much will be mapped with this memory
region is given by "vma->vm_end - vma->vm_start" and is
represented by vsize. Hence the need existed to perform
the sanity check since User can get more than what it
intended.
Kind regards,
Sanjeev Ranot

"simple_region_start" is the offset from the starting of
physical memory out of which our sub-region needs to be mapped
Example:
off = start of the physical memory (page aligned)= 0xd000 8000
simple_region_start = 0x1000
therefore the physical address of the start of the sub region
we want to map is = 0xd000 8000 + 0x1000
= 0xd000 9000
now virtual size is the portion that needs to be mapped from the
physical memory available. This must be defined properly by the user.
simple_region_size = physical address pointing to last of the
portion that we need to map.
So if we wanted 8KBs to be mapped from the physical memory available
then following is how the calculation goes
simple_region_size = physical address just beyond the last of our portion
simple_region_size = 0xd000 9000 + 0x2000 (for the 8KBs)
simple_region_size = 0xd000 B000
So our 8KBs of portion will range from physical addresses [0xd000 B000 to 0xd000 9000]
Hence physical size i.e. psize = 0x2000
We perform the sanity check i.e
If the size of our portion of physical memory is smaller
than what the user tries to map using the full length
virtual address range of this memory region, then we
raise an exception. i.e say for ex. vsize = 0x3000
Otherwise we use the API "remap_pfn_range" to map the
portion of the physical memory passing in the the
physical address and not the page frame number as
was done previously since this is the IO memory.
I feel it should have been the API "io_remap_page_range"
here the aforementioned.
So it will map the portion of physical memory starting
from the physical address 0xd000 9000 on the the user
linear address starting from vma->vm_start of vsize.
N.B As before I have yet to test this out !

Related

How does address of kernel directly mapping area map to physical address

In my first thought, I think when convert kernel virtual address in directly mapping zone to physical address. It should just minus the page_offset_base(0xffff888000000000 in 4-level paging).
But I do a simple test, I write a kernel module and require some memory by kmalloc(), then print it out in %x format and %p format. There are different in lower bits, not a minus of page_offset_base. I'm wondering why?
here is my code:
buf_high = (char*)kmalloc(140, __GFP_HIGHMEM);
buf = (char*)kmalloc(140, GFP_KERNEL);
printk("kmalloc highmem : %lx, %p\n", buf_high, buf_high);
printk("kmalloc : %lx, %p\n", buf, buf);
Here is the output:
[ 1157.671135] kmalloc highmem : ffff888118d2b000, 000000009412a3d4
[ 1157.671142] kmalloc : ffff888118d2b3c0, 000000009ce9058c

Retrieving the physical addresses of all the pages of a virtual memory buffer

I am currently trying to find the physical addresses of all the pages linked to a virtual buffer.
For example if I allocate a buffer of 8kb in virtual memory and that in terms of physical memory it allocates me 2 pages (in the case where the pages are 4kb and everything goes well), I would like to find the physical address of these 2 pages. As if I had allocated them with get_free_pages.
For that I realized this little piece of code by looking at the functions which was at my disposal in the kernel :
void *vbuffer;
struct page *page_list;
unsigned long current_page_addr;
/* Allocate virtual memory buffer */
vbuffer = vmalloc(size);
/* Get pages list from the previously allocated virtual buffer */
page_list = vmalloc_to_page(vbuffer);
do {
current_page_addr = page_to_phys(page_list);
page_list = page_list->next;
} while(page_list != NULL);
My question is therefore to know if my code is correct and actually allows me to retrieve my physical addresses of the pages in my current_page_addr variable.
Also if I am using the right way or if there is a better way, thank you.
I don't think the pages for the vmalloced region are in a list. vmalloc_to_page returns the page corresponding to a virtual address mapped in the vmalloc region, if the page exists. So to find the pages corresponding to the memory allocated by vmalloc or vzalloc, start with the returned address for the first page, and increment the address by PAGE_SIZE for each subsequent page.
void *vbuffer;
struct page *page;
unsigned long current_page_addr;
/* Allocate virtual memory buffer */
vbuffer = vmalloc(size);
if (vbuffer == NULL) {
goto error;
}
for (size_t offset = 0; offset < size; offset += PAGE_SIZE) {
page = vmalloc_to_page(vbuffer + offset);
current_page_addr = page_to_phys(page);
}

Mapping of dmam_alloc_coherent allocated memory to the user space via remap_pfn_range gives pointer to wrong area of memory

I prepare an application running on ARM Intel Cyclone V SoC.
I need to map the DMA coherent memory buffer to the user space.
The buffer is allocated in the device driver with with:
buf_addr = dmam_alloc_coherent(&pdev->dev, size, &dma_addr, GFP_KERNEL);
The mapping is done correctly, and I have verified, that the buffer accessed by the hardware via dma_addr HW address is visible for the kernel via buf_addr pointer.
Then in the mmap function of the device driver I do:
unsigned long physical = virt_to_phys(buf_addr);
unsigned long vsize = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range(vma,vma->vm_start, physical >> PAGE_SHIFT , vsize, vma->vm_page_prot);
The application mmaps the buffer with:
buf = mmap(NULL,buf_size,PROT_READ | PROT_WRITE, dev_file, MAP_SHARED);
I do not get any error from remap_pfn_range function. Also the application is able to access the mmapped memory, but it is not the buffer allocated with dmam_alloc_coherent.
I have found the macro dma_mmap_coherent that seems to be dedicated particularly for that purpose.
I have verified that the following modification in the mmap function ensures proper operation:
unsigned long vsize = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap=dma_mmap_coherent(&my_pdev->dev,vma,fdata, dma_addr, vsize);
Because the pdev pointer is not directly delivered to the mmap function it is passed from the probe function via the global variable my_pdev. In case of driver supporting multiple devices, it should be stored in the device context.

Understanding /proc/sys/vm/lowmem_reserve_ratio

I am not able to understand the meaning of the variable "lowmem_reserve_ratio" by reading the explanation from Documentation/sysctl/vm.txt.
I have also tried to google it but all the explanations found are exactly similar as present in vm.txt.
It will be really helpful if sb explains it or mention some link about it.
Here goes the original explanation:-
The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32
-
Note: # of this elements is one fewer than number of zones. Because the highest
zone's value is not necessary for following calculation.
But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.
-
Node 0, zone DMA
pages free 1355
min 3
low 3
high 4
:
:
numa_other 0
protection: (0, 2004, 2004, 2004)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pagesets
cpu: 0 pcp: 0
:
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.
In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.
zone[i]'s protection[j] is calculated by following expression.
(i < j):
zone[i]->protection[j]
= (total sums of present_pages from zone[i+1] to zone[j] on the node)
/ lowmem_reserve_ratio[i];
(i = j):
(should not be protected. = 0;
(i > j):
(not necessary, but looks 0)
The default values of lowmem_reserve_ratio[i] are
256 (if zone[i] means DMA or DMA32 zone)
32 (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total present
pages of higher zones on the node.
If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).
having the same problem as you, I googled (a lot) and stumbled apon this page which might (or might not) be more understandable than the kernel doc.
(I do not quote here because it will be unreadable)
I found the wording in that document really confusing too. Looking at the source in mm/page_alloc.c helped to clear it up, so let me try my hand at a more straightforward explanation:
As is said in the page you quoted, these numbers "are reciprocal number of ratio". Worded differently: these numbers are divisors. So when calculating the reserve pages for a given zone in a node, you take the sum of pages in that node in zones higher than that one, divide it by the provided divisor, and that's how many pages you're reserving for that zone.
Example: let's assume a 1 GiB node with 768 MiB in zone Normal and 256 MiB in zone HighMem (assume no zone DMA). Let's assume the default highmem reserve "ratio" (divisor) of 32. And let's assume the typical 4 KiB page size. Now we can calculate the reserve area for zone Normal:
Sum of "higher" zones than zone Normal (just HighMem): 256 MiB = (1024 KiB / 1 MiB) * (1 page / 4 KiB) = 65536 pages
Area reserved in zone Normal for this node: 65536 pages / 32 = 2048 pages = 8 MiB.
The concept stays the same when you add more zones and nodes. Just remember that the reserved size is in pages---you never reserve a fraction of a page.
I find the kernel source code that explain very well and clear.
/*
* setup_per_zone_lowmem_reserve - called whenever
* sysctl_lowmem_reserve_ratio changes. Ensures that each zone
* has a correct pages reserved value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*/
static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
enum zone_type j, idx;
for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long managed_pages = zone->managed_pages;
zone->lowmem_reserve[j] = 0;
idx = j;
while (idx) {
struct zone *lower_zone;
idx--;
if (sysctl_lowmem_reserve_ratio[idx] < 1)
sysctl_lowmem_reserve_ratio[idx] = 1;
lower_zone = pgdat->node_zones + idx;
lower_zone->lowmem_reserve[j] = managed_pages /
sysctl_lowmem_reserve_ratio[idx];
managed_pages += lower_zone->managed_pages;
}
}
}
/* update totalreserve_pages */
calculate_totalreserve_pages();
}
And here even list an demo.
/*
* results with 256, 32 in the lowmem_reserve sysctl:
* 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
* 1G machine -> (16M dma, 784M normal, 224M high)
* NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
* HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
* HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA
*
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
#ifdef CONFIG_ZONE_DMA
256,
#endif
#ifdef CONFIG_ZONE_DMA32
256,
#endif
#ifdef CONFIG_HIGHMEM
32,
#endif
32,
};
In a word, the expression looks like,
zone[1]->lowmem_reserve[2] = zone[2]->managed_pages / sysctl_lowmem_reserve_ratio[1]
zone[0]->lowmem_reserve[2] = (zone[1] + zone[2])->managed_pages / sysctl_lowmem_reserve_ratio[0]

How to correctly calculate address spaces?

Below is an example of a question given on my last test in a Computer Engineering course. Anyone mind explaining to me how to get the start/end addresses of each? I have listed the correct answers at the bottom...
The MSP430F2410 device has an address space of 64 KB (the basic MSP430 architecture). Fill in the table below if we know the following. The first 16 bytes of the address space (starting at the address 0x0000) is reserved for special function registers (IE1, IE2, IFG1, IFG2, etc.), the next 240 bytes is reserved for 8-bit peripheral devices, and the next 256 bytes is reserved for 16-bit peripheral devices. The RAM memory capacity is 2 Kbytes and it starts at the address 0x1100. At the top of the address space is 56KB of flash memory reserved for code and interrupt vector table.
What Start Address End Address
Special Function Registers (16 bytes) 0x0000 0x000F
8-bit peripheral devices (240 bytes) 0x0010 0x00FF
16-bit peripheral devices (256 bytes) 0x0100 0x01FF
RAM memory (2 Kbytes) 0x1100 0x18FF
Flash Memory (56 Kbytes) 0x2000 0xFFFF
For starters, don't get thrown off by what's stored in each segment - that's only going to confuse you. The problem is just asking you to figure out the hex numbering, and that's not too difficult. Here are the requirements:
64 KB total memory
The first 16 bytes of the address space (starting at the address 0x0000) is reserved for special function registers (IE1, IE2, IFG1, IFG2, etc.)
The next 240 bytes is reserved for 8-bit peripheral devices
The next 256 bytes is reserved for 16-bit peripheral devices
The RAM memory capacity is 2 Kbytes and it starts at the address 0x1100
At the top of the address space is 56KB of flash memory reserved for code and interrupt vector table.
Since each hex digit in your memory address can handle 16 values (0-F), you'll need 4 digits to display 64KB of memory (16 ^ 4 = 65536, or 64K).
You start with 16 bytes, and that covers 0x0000 - 0x000F (one full digit of your address). That means that the next segment, which starts immediately after it (8-bit devices), begins at 0x0010 (the next byte), and since it's 240 bytes long, it ends at byte 256 (240 + 16), or 0x00FF.
The next segment (16-bit devices) starts at the next byte, which is 0x0100, and is 256 bytes long - that puts the end at 0x01FF.
Then comes 2KB (2048 bytes) of RAM, but it starts at 0x1100, as the description states, instead of immediately after the previous segment, so that's your starting address. Add 2048 to that, and you get 0x18FF.
The last segment covers the upper section of the memory, so you'll have to work backwards, You know it ends at 0xFFFF (the end of the available memory), and it's 56KB long. If you convert the 56KB to hex, it's 0xDFFF. If you imagine that this segment starts at 0, That leaves 2000 unused (0xE000-0xEFFF and 0xF000-0xFFFF), so you know that this segment has to start at 0x2000 to end at the upper end of the memory space.
I hope that's more clear, though when I read over it, I don't know that it's any help at all :( Maybe that's why I'll leave teaching that concept to somebody more qualified...
#define NUM_SIZES 5
uint16_t sizes[5] = {16, 240, 256, 2 * 1024, 56 * 1024};
uint16_t address = 0;
printf("Start End\n");
for (int i = 0; i < NUM_SIZES; i++)
{
printf("0x%04X 0x%04X\n", address, address + sizes[i] - 1);
address += sizes[i];
}

Resources