How to choose static IO memory map for virtual memory on ARM - linux

I am investigating how to port the Linux kernel to a new ARM platform.
I noticed that some platform implementations have static mapping from a physical IO address to a virtual address in map_io function.
My question is how should I decide the "virtual" address in structure map_desc? Can I map a physical IO to arbitrary virtual memory? or are there some rules or good practice about it? I checked http://lxr.free-electrons.com/source/Documentation/arm/memory.txt , but did not find any answers.
Here are some examples of map_desc and map_io:
http://lxr.free-electrons.com/source/arch/arm/mach-versatile/versatile_dt.c#L45
44 DT_MACHINE_START(VERSATILE_PB, "ARM-Versatile (Device Tree Support)")
45 .map_io = versatile_map_io,
46 .init_early = versatile_init_early,
47 .init_machine = versatile_dt_init,
48 .dt_compat = versatile_dt_match,
49 .restart = versatile_restart,
50 MACHINE_END
http://lxr.free-electrons.com/source/arch/arm/mach-versatile/core.c#L189
189 void __init versatile_map_io(void)
190 {
191 iotable_init(versatile_io_desc, ARRAY_SIZE(versatile_io_desc));
192 }
131 static struct map_desc versatile_io_desc[] __initdata __maybe_unused = {
132 {
133 .virtual = IO_ADDRESS(VERSATILE_SYS_BASE),
134 .pfn = __phys_to_pfn(VERSATILE_SYS_BASE),
135 .length = SZ_4K,
136 .type = MT_DEVICE
137 }, {

too long for a comment...
Not an expert but, since map_desc is for static mappings. It should be from system manuals. virtual is how the peripheral can be accessed from kernel virtual space, pfn (page frame number) is physical address by page units.
Thing is if you are in kernel space, you are using kernel virtual space mappings so even you want to access a certain physical address you need to have a mapping for that which can be one-to-one which I believe what you get out of map_desc.
Static mapping is map_desc, dynamic mapping is ioremap. So if you want to play with physical IO, ioremap is first thing if that doesn't work then for special cases map_desc.
DMA-API-HOWTO provides a good entry point to different kind of addresses mappings in Linux.

Related

Where and When Linux Kernel Setup GDT?

I have some doubt regarding GDT in linux. I try to get GDT info in kernel space (Ring0), and my test code called in system call context. In the test code, I try to print ss register (Segment Selector), and get ss segment descriptor by GDTR and ss-segment-selector.
77 void printGDTInfo(void) {
78 struct desc_ptr pgdt, *pss_desc;
79 unsigned long ssr;
80 struct desc_struct *ss_desc;
81
82 // Get GDTR
83 native_store_gdt(&pgdt);
84 unsigned long gdt_addr = pgdt.address;
85 unsigned long gdt_size = pgdt.size;
86 printk("[GDT] Addr:%lu |Size:%lu\n", gdt_addr, gdt_size);
87
88 // Get SS Register
89 asm("mov %%ss, %%eax"
90 :"=a"(ssr));
91 printk("SSR In Kernel:%lu\n", ssr);
92 unsigned long desc_index = ssr >> 3; // SHIFT for Descriptor Index
93 printk("SSR Shift:%lu\n", desc_index);
94 ss_desc = (struct desc_struct*)(gdt_addr + desc_index * sizeof(struct desc_struct));
95 printk("SSR:Base0:%lu, Base1:%lu,Base2:%lu\n", ss_desc->base0, ss_desc->base1, ss_desc->base2);
96 }
What confused me most is the "base" fields in ss-descriptor are all zero (line95 print). I try to print __USER_DS segment descriptor, the "base" fields are also zero.
Is that true? All the segment in Linux use same Base Address(zero)?
I want to check the GDT initialization in Linux Source Code but I am not sure where and when Linux setup GDT?
I find codes in "arch/x86/kernel/cpu/common.c"like this, the second parameter(zero) of GDT_ENTRY_INIT is zero, which means the base0/base1/base2 fields in segment descriptor are all zero.
125 [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
126 [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
127 [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
If that is true, all the segment has same base address(zero). As a result, same virtual address in Ring0 and Ring1 will mapping to the same linear address?
I am appreciate for your help :)

Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back"

I am mapping DMA coherent memory from kernel to user space. At user level I use mmap() and in kernel driver I use dma_alloc_coherent() and afterwards remap_pfn_range() to remap the pages. This basically works as I can write data to the mapped area in my app and verify it in my kernel driver.
However, despite using dma_alloc_coherent (which should alloc uncached memory) and pgprot_noncached() the kernel informs me with this dmesg output:
map pfn ram range req uncached-minus for [mem 0xABC-0xCBA], got write-back
In my understanding, write-back is cached memory. But I need uncached memory for the DMA operation.
The Code (only showing the important parts):
User App
fd = open(dev_fn, O_RDWR | O_SYNC);
if (fd > 0)
{
mem = mmap ( NULL
, mmap_len
, PROT_READ | PROT_WRITE
, MAP_SHARED
, fd
, 0
);
}
For testing purposes I used mmap_len = getpagesize(); Which is 4096.
Kernel Driver
typedef struct
{
size_t mem_size;
dma_addr_t dma_addr;
void *cpu_addr;
} Dma_Priv;
fops_mmap()
{
dma_priv->mem_size = vma->vm_end - vma->vm_start;
dma_priv->cpu_addr = dma_alloc_coherent ( &gen_dev
, dma_priv->mem_size
, &dma_priv->dma_addr
, GFP_KERNEL
);
if (dma_priv->cpu_addr != NULL)
{
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range ( vma
, vma->vm_start
, virt_to_phys(dma_priv->cpu_addr)>>PAGE_SHIFT
, dma_priv->mem_size
, vma->vm_page_prot
)
}
}
Useful information I've found
PATting Linux:
Page 7 --> mmap with O_SYNC (uncached):
Applications can open /dev/mem with the O_SYNC flag and then do mmap
on it. With that, applications will be accessing that address with an
uncached memory type. mmap will succeed only if there is no other
conflicting mappings to the same region.
I used the flag, doesn't help.
Page 7 --> mmap without O_SYNC (uncached-minus):
mmap without O_SYNC, no existing mapping, and not a write-back region:
For an mmap that comes under this category, we use uncached-minus type
mapping. In the absence of any MTRR for this region, the effective
type will be uncached. But in cases where there is an MTRR, making
this region write-combine, then the effective type will be
write-combine.
pgprot_noncached()
In /arch/x86/include/asm/pgtable.h I found this:
#define pgprot_noncached(prot) \
((boot_cpu_data.x86 > 3) \
? (__pgprot(pgprot_val(prot) | \
cachemode2protval(_PAGE_CACHE_MODE_UC_MINUS))) \
: (prot))
Is it possible that x86 always sets a noncached request to UC_MINUS, which results in combination with MTRR in a cached write-back?
I am using Ubuntu 16.04.1, Kernel: 4.10.0-40-generic.
https://www.kernel.org/doc/Documentation/x86/pat.txt
Drivers wanting to export some pages to userspace do it by using mmap
interface and a combination of 1) pgprot_noncached() 2)
io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
With PAT support, a new API pgprot_writecombine is being added. So,
drivers can continue to use the above sequence, with either
pgprot_noncached() or pgprot_writecombine() in step 1, followed by
step 2.
In addition, step 2 internally tracks the region as UC or WC in
memtype list in order to ensure no conflicting mapping.
Note that this set of APIs only works with IO (non RAM) regions. If
driver wants to export a RAM region, it has to do set_memory_uc() or
set_memory_wc() as step 0 above and also track the usage of those
pages and use set_memory_wb() before the page is freed to free pool.
I added set_memory_uc() before pgprot_noncached() and it did the thing.
if (dma_priv->cpu_addr != NULL)
{
set_memory_uc(dma_priv->cpu_addr, (dma_priv->mem_size/PAGE_SIZE));
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range ( vma
, vma->vm_start
, virt_to_phys(dma_priv->cpu_addr)>>PAGE_SHIFT
, dma_priv->mem_size
, vma->vm_page_prot
)
}
This answer was posted as an edit to the question Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back" by the OP Gbo under CC BY-SA 4.0.

How does the Linux kernel get info about the processors and the cores?

Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!
Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.
This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor

Why ISA doesn't need request_mem_region

I'm reading the source code of LDD3 Chapter 9. And there's an example for ISA driver named silly.
The following is initialization for the module. What I don't understand is why there's no call for "request_mem_region()" before invocation for ioremap() in line 282
268 int silly_init(void)
269 {
270 int result = register_chrdev(silly_major, "silly", &silly_fops);
271 if (result < 0) {
272 printk(KERN_INFO "silly: can't get major number\n");
273 return result;
274 }
275 if (silly_major == 0)
276 silly_major = result; /* dynamic */
277 /*
278 * Set up our I/O range.
279 */
280
281 /* this line appears in silly_init */
282 io_base = ioremap(ISA_BASE, ISA_MAX - ISA_BASE);
283 return 0;
284 }
This particular driver allows accesses to all the memory in the range 0xA0000..0x100000.
If there actually are any devices in this range, then it is likely that some other driver already has reserved some of that memory, so if silly were try to call request_mem_region, it would fail, or it would be necessary to unload that other driver before loading silly.
On a PC, this range contains memory of the graphics card, and the system BIOS:
$ cat /proc/iomem
...
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cedff : Video ROM
000d0000-000dffff : PCI Bus 0000:00
000e4000-000fffff : reserved
000f0000-000fffff : System ROM
...
Unloading the graphics driver often is not possible (because it's not a module), and would prevent you from seeing what the silly driver does, and the ROM memory ranges are reserved by the kernel itself and cannot be freed.
TL;DR: Not calling request_mem_region is a particular quirk of the silly driver.
Any 'real' driver would be required to call it.

nopage () method implementation

Any one know about how virtual address is translated to physical address in no page method.
with reference to Device Drivers book the nopage method is given as ,
struct page *simple_vma_nopage(struct vm_area_struct *vma,
unsigned long address, int *type)
{
struct page *pageptr;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long physaddr = address - vma->vm_start + offset;
unsigned long pageframe = physaddr >> PAGE_SHIFT;
if (!pfn_valid(pageframe))
return NOPAGE_SIGBUS;
pageptr = pfn_to_page(pageframe);
get_page(pageptr);
if (type)
*type = VM_FAULT_MINOR;
return pageptr;
}
page_shift is the number of bits used to reperesent offset for Virtual and physical memory address.
But what is the offset variable ?
How a physical address is calculated from arithmetic operations on virtual address variables like address and vm_start ?
I feel the documentation of vm_pgoff is not very clear.
This is the offset of the first page of the memory region in RAM.
So if our RAM begins at 0x00000000, and our memory region begins
at 0x0000A000, then vm_pgoff = 10. If you consider/ revisit the mmap
system call then you can see that the "offset" which we pass is the offset
of the starting byte in the file from which "length" bytes will be mapped
on to the memory region. This offset can be converted to address by left
shifting it to PAGE_SHIFT value which is 12 (i.e. 4KB per page size)
Now, irrespective of whether the cr3 register is used in linear address to
physical address translation or not, when we say that "address - vm_start"
then this gives the size of portion between the addresses.
example:
vm_start = 0xc0080000
address = 0xc0090000
address - vm_start = 0x00010000
physaddr = (address - vma->vm_start) + offset;
= 0x00010000 + (10 << PAGE_SHIFT)
= offset_to_page_that_fault + start_addr_of_memoryRegion_in_RAM
= 0x00010000 + 0x0000A000
= 0x0001A000
Now since this is the physical address therefore we need to convert to page frame
number by right shifting by PAGE_SHIFT value i.e 0x0001A000 >> 12 = 0x1A = 26 (decimal)
Therefore the 26th page-frame must be loaded with the data from the file which is being mapped.
Hence data is retrieved from the disk by using the inode's struct address_sapce
which contains the information of the location of page on the disk (swap space).
Once the data is brought in we return the struct page which represents this data in the
page_frame for which this page fault occurred. We return this to the user.
This is my understanding of the latest but I haven't tested it.
No, the statement in the book is correct, because
As aforementioned, "physical" is just the address
Of starting of your region/portion that you want
To map out of the physical memory which starts
From "off" physical address till the "simple_region_size"
The "simple_region_size" value is decided by the user.
Similarly "simple_region_start" is decided by the user.
simple_region_start >= off
So the maximum physical memory that user can map
Is decided by: psize = simple_region_size - off
I.e from start of physical memory till end
of the portion.
But actually how much will be mapped with this memory
region is given by "vma->vm_end - vma->vm_start" and is
represented by vsize. Hence the need existed to perform
the sanity check since User can get more than what it
intended.
Kind regards,
Sanjeev Ranot
"simple_region_start" is the offset from the starting of
physical memory out of which our sub-region needs to be mapped
Example:
off = start of the physical memory (page aligned)= 0xd000 8000
simple_region_start = 0x1000
therefore the physical address of the start of the sub region
we want to map is = 0xd000 8000 + 0x1000
= 0xd000 9000
now virtual size is the portion that needs to be mapped from the
physical memory available. This must be defined properly by the user.
simple_region_size = physical address pointing to last of the
portion that we need to map.
So if we wanted 8KBs to be mapped from the physical memory available
then following is how the calculation goes
simple_region_size = physical address just beyond the last of our portion
simple_region_size = 0xd000 9000 + 0x2000 (for the 8KBs)
simple_region_size = 0xd000 B000
So our 8KBs of portion will range from physical addresses [0xd000 B000 to 0xd000 9000]
Hence physical size i.e. psize = 0x2000
We perform the sanity check i.e
If the size of our portion of physical memory is smaller
than what the user tries to map using the full length
virtual address range of this memory region, then we
raise an exception. i.e say for ex. vsize = 0x3000
Otherwise we use the API "remap_pfn_range" to map the
portion of the physical memory passing in the the
physical address and not the page frame number as
was done previously since this is the IO memory.
I feel it should have been the API "io_remap_page_range"
here the aforementioned.
So it will map the portion of physical memory starting
from the physical address 0xd000 9000 on the the user
linear address starting from vma->vm_start of vsize.
N.B As before I have yet to test this out !

Resources