I read the book "Linux Kernel Development", and find some functions that make me confused, listed as bellow:
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
void __free_pages(struct page *page, unsigned int order)
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
The problem is the use of the two underline in the function name, and how the function pairs.
1. when will the linux kernel uses two underline in its function name?
2. why alloc_pages is paired with __free_pages, but not free_pages?
As you can notice:
alloc_pages() / __free_pages() takes "page *" (page descriptor) as argument.
They are ususally used internally by some infrastrcture kernel code, like page fault handler, which wish to manipulate page descriptor instead of memory block content.
__get_free_pages() / free_pages() takes "unsigned long" (virtual address of memory block) as argument
They could be used by code which wish to use the memory block itself, after allocation, you can read / write to this memory block.
As for their name and double underscore "__", you don't need to bother too much. Sometimes kernel functions were named casually without too much consideration when they were first written. And when people think of that the names are not proper, but later those functions are already used wildly in kernel, and kernel guys are simply lazy to change them.
Related
I recently wrote a module implementing these functions.
What is the difference between the two? From my understanding, the copy_..._user functions are more secure. Please correct me if I'm mistaken.
Furthermore, is it a bad idea to mix the two functions in one program? For example, I used simple_read_from_buffer in my misc dev read function, and copy_from_user in my write function.
Edit: I believe I've found the answer to my question from reading fs/libfs.c (I wasn't aware that this was where the source code was located for these functions); from my understanding the simple_...() functions are essentially a wrapper around the copy_...() functions. I think it was appropriate in my case to use copy_from_user for the misc device write function as I needed to validate that the input matched a specific string before returning it to the user buffer.
I will still leave this question open though in case someone has a better explanation or wants to correct me!
simple_read_from_buffer and simple_write_to_buffer are just convenience wrappers around copy_{to,from}_user for when all you need to do is service a read from userspace from a kernel buffer, or service a write from userspace to a kernel buffer.
From my understanding, the copy_..._user functions are more secure.
Neither version is "more secure" than the other. Whether or not one might be more secure depends on the specific use case.
I would say that simple_{read,write}_... could in general be more secure since they do all the appropriate checks for you before copying. If all you need to do is service a read/write to/from a kernel buffer, then using simple_{read,write}_... is surely faster and less error-prone than manually checking and calling copy_{from,to}_user.
Here's a good example where those functions would be useful:
#define SZ 1024
static char kernel_buf[SZ];
static ssize_t dummy_read(struct file *filp, char __user *user_buf, size_t n, loff_t *off)
{
return simple_read_from_buffer(user_buf, n, off, kernel_buf, SZ);
}
static ssize_t dummy_write(struct file *filp, char __user *user_buf, size_t n, loff_t *off)
{
return simple_write_to_buffer(kernel_buf, SZ, off, user_buf, n);
}
It's hard to tell what exactly you need without seeing your module's code, but I would say that you can either:
Use copy_{from,to}_user if you want to control the exact behavior of your function.
Use a return simple_{read,write}_... if you don't need such fine-grained control and you are ok with just returning the standard values produced by those wrappers.
I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.
Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.
I have gone through the following topic and I still have some questions.
ioread32 followed by iowrite32 not giving same value
In the link, where can I get my base which is defined as 0xfed00000
in the post ?
what should I put for the second parameter in
void request_mem_region(unsigned long start, unsigned long len,char *name);
what should I put for the second parameter in
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
By having the Makefile and generating the kernel module, I should use the insmod and then dmesg to check if the code works as I expect, is this correct ?
In the case, should I add iounmap(virtual_base); before return 0; in the source ?
Thanks
In the link, where can I get my base which is defined as 0xfed00000 in the post ?
It's the base (physical) address of the peripheral's registers.
If the peripheral is a discrete chip on the board, then consult the board documentation.
If the peripheral is embedded in a SoC, then consult the memory map in the SoC datasheet.
what should I put for the second parameter in
void request_mem_region(unsigned long start, unsigned long len,char *name);
what should I put for the second parameter in
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
These two routines should be called with the same first and second parameters.
The length/size is the number of bytes the peripheral's register set occupies.
Sometimes the entire memory region to the next peripheral is specified.
By having the Makefile and generating the kernel module, I should use the insmod and then dmesg to check if the code works as I expect, is this correct ?
A judicious sprinkling of printk() statements is the tried & true method of testing a Linux kernel driver/module.
Unix has kdb.
In the case, should I add iounmap(virtual_base); before return 0; in the source ?
Do not copy that poorly written example of init code.
If ioremap() is performed in a driver's probe() (or other initialization) routine, then the iounmap() should be in the probe's error exit sequence and in the driver's remove() (or the complementary to init) routine.
There are numerous examples to study in the Linux kernel source. Use an online Linux cross reference such as http://lxr.free-electrons.com/source/
Note that almost all Linux drivers use iounmap() two or more times.
I have seen that __iomem is used to store the return type of ioremap(), but I have used u32 in ARM architecture for it and it works well.
So what difference does __iomem make here? And in which circumstances should I use it exactly?
Lots of type casts are going to just "work well". However, this is not very strict. Nothing stops you from casting a u32 to a u32 * and dereference it, but this is not following the kernel API and is prone to errors.
__iomem is a cookie used by Sparse, a tool used to find possible coding faults in the kernel. If you don't compile your kernel code with Sparse enabled, __iomem will be ignored anyway.
Use Sparse by first installing it, and then adding C=1 to your make call. For example, when building a module, use:
make -C $KPATH M=$PWD C=1 modules
__iomem is defined like this:
# define __iomem __attribute__((noderef, address_space(2)))
Adding (and requiring) a cookie like __iomem for all I/O accesses is a way to be stricter and avoid programming errors. You don't want to read/write from/to I/O memory regions with absolute addresses because you're usually using virtual memory. Thus,
void __iomem *ioremap(phys_addr_t offset, unsigned long size);
is usually called to get the virtual address of an I/O physical address offset, for a specified length size in bytes. ioremap() returns a pointer with an __iomem cookie, so this may now be used with inline functions like readl()/writel() (although it's now preferable to use the more explicit macros ioread32()/iowrite32(), for example), which accept __iomem addresses.
Also, the noderef attribute is used by Sparse to make sure you don't dereference an __iomem pointer. Dereferencing should work on some architecture where the I/O is really memory-mapped, but other architectures use special instructions for accessing I/Os and in this case, dereferencing won't work.
Let's look at an example:
void *io = ioremap(42, 4);
Sparse is not happy:
warning: incorrect type in initializer (different address spaces)
expected void *io
got void [noderef] <asn:2>*
Or:
u32 __iomem* io = ioremap(42, 4);
pr_info("%x\n", *io);
Sparse is not happy either:
warning: dereference of noderef expression
In the last example, the first line is correct, because ioremap() returns its value to an __iomem variable. But then, we deference it, and we're not supposed to.
This makes Sparse happy:
void __iomem* io = ioremap(42, 4);
pr_info("%x\n", ioread32(io));
Bottom line: always use __iomem where it's required (as a return type or as a parameter type), and use Sparse to make sure you did so. Also: do not dereference an __iomem pointer.
Edit: Here's a great LWN article about the inception of __iomem and functions using it.
Simple, Straight and Short (S3) Explanation.
There is an article https://lwn.net/Articles/653585/ for more details.
I'm trying to write a function that write-protects every pte in a given vm_area_struct. What is the function that gives me the ptep for a given address? I have:
pte_t *ptep;
for (addr = start; addr < end; addr += PAGE_SIZE) {
ptep = WHATS_THIS_FUNCTION_CALLED(addr);
ptep_set_wrprotect(mm, addr, ptep);
}
What's the WHATS_THIS_FUNCTION_CALLED called?
The short answer to your question is to use __get_locked_pte . However, I would advise against it since there are much better (efficient and fair in terms of resource contention) ways to accomplish your goal.
In linux, the typical idiom for traversing page tables is with a nested for loop four levels deep (four is the number of page table levels linux supports). For examples, see copy_page_range and apply_to_page_range in mm/memory.c. In fact, if you look closely at copy_page_range, it is called when forking from dup_mmap in kernel/fork.c. It operates on an entire vm_area_struct essentially.
You can replicate the idiom used in either of those functions. There are some caveats, however. For example, copy_page_range fully supports transparent hugepages (2.6.38) by using a completely separate copy_huge_pmd inside copy_pmd_range. Unless you want to write two separate functions (one for normal pages and one for transparent huge pages, see Gracefull fallback in Documentation/vm/transhuge.txt.
The point is that virtual memory in linux is very complicated, so be sure to completely understand every possible use case. follow_page in mm/memory.c should demonstrate how to cover all of your bases.
I believe you are looking for either the function virt_to_pte as defined here or your own re-implementation of it.
This function uses pte_offset_kernel(pmd_t * dir, unsigned long address) which takes a pmd (Page mid-level Directory) structure and an address , though one could also use pte_offset(pmd_t * dir, unsigned long address).
See Linux Device Drivers 3rd Edition Chapter 15 or Linux Device Drivers, 2nd Edition Chapter 13 mmap and DMA for further references.