How do I find the ptep for a given address? - linux

I'm trying to write a function that write-protects every pte in a given vm_area_struct. What is the function that gives me the ptep for a given address? I have:
pte_t *ptep;
for (addr = start; addr < end; addr += PAGE_SIZE) {
ptep = WHATS_THIS_FUNCTION_CALLED(addr);
ptep_set_wrprotect(mm, addr, ptep);
}
What's the WHATS_THIS_FUNCTION_CALLED called?

The short answer to your question is to use __get_locked_pte . However, I would advise against it since there are much better (efficient and fair in terms of resource contention) ways to accomplish your goal.
In linux, the typical idiom for traversing page tables is with a nested for loop four levels deep (four is the number of page table levels linux supports). For examples, see copy_page_range and apply_to_page_range in mm/memory.c. In fact, if you look closely at copy_page_range, it is called when forking from dup_mmap in kernel/fork.c. It operates on an entire vm_area_struct essentially.
You can replicate the idiom used in either of those functions. There are some caveats, however. For example, copy_page_range fully supports transparent hugepages (2.6.38) by using a completely separate copy_huge_pmd inside copy_pmd_range. Unless you want to write two separate functions (one for normal pages and one for transparent huge pages, see Gracefull fallback in Documentation/vm/transhuge.txt.
The point is that virtual memory in linux is very complicated, so be sure to completely understand every possible use case. follow_page in mm/memory.c should demonstrate how to cover all of your bases.

I believe you are looking for either the function virt_to_pte as defined here or your own re-implementation of it.
This function uses pte_offset_kernel(pmd_t * dir, unsigned long address) which takes a pmd (Page mid-level Directory) structure and an address , though one could also use pte_offset(pmd_t * dir, unsigned long address).
See Linux Device Drivers 3rd Edition Chapter 15 or Linux Device Drivers, 2nd Edition Chapter 13 mmap and DMA for further references.

Related

What does the IS_ALIGNED macro in the linux kernel do?

I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
BUG();
}
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?
In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference): https://graphics.stanford.edu/~seander/bithacks.html

How to find TLS segments of the current thread on linux amd64?

I'm looking for a way to find out the memory addresses of TLS segments for the current thread on linux, amd64. Bonus point for a solution that works on OSX.
Looked into various language runtime or GC (like boehm), but couldn't go through the multiple layer of abstractions to support all kind of systems so far. Any help appreciated.
Did you have a look at the solution Martin and I came up with in druntime?
What we do there boils down to scanning the segments in the corresponding dl_phdr_info (obtained by looking for the correct one using dl_iterate_phdr) for the segment with type PT_TLS, and storing its module id and size.
You can then get the start of the address range on the current thread by calling __tls_get_addr for offset 0 and the module id (there is an offset on some archs), and the end by simply adding the size you determined to that. If you do not need to support shared libraries, you can also simply use fs/gs on x86 for that (might be required if you want to link a static executable).
This works for Linux and FreeBSD (and probably other ELF platforms), but not OS X. There, the best I could come up with so far is this:
void _d_dyld_getTLSRange(void* arbitraryTLSSymbol, void** start, size_t* size) {
dyld_enumerate_tlv_storage(
^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
assert(state == dyld_tlv_state_allocated);
if (info->tlv_addr <= arbitraryTLSSymbol &&
arbitraryTLSSymbol < (info->tlv_addr + info->tlv_size)
) {
// Found the range we are looking for.
*start = info->tlv_addr;
*size = info->tlv_size;
}
}
);
}
The naive implementation currently used in LDC's druntime does not quite handle shared libraries, though, and dyld_enumerate_tlv_storage is from dyld_priv.h, which might or might not be a problem for App Store publishing.
On Linux, the thread-specific segment is set up via arch_prtcl(ARCH_SET_FS, <addr>) call. You can find out what it was set to in the current thread via arch_prctl(ARCH_GET_FS, ...).
Bonus point for a solution that works on OSX.
OSX is a completely different OS, and uses completely different mechanism for its TLS support.

Accessing memory pointers in hardware registers

I'm working on enhancing the stock ahci driver provided in Linux in order to perform some needed tasks. I'm at the point of attempting to issue commands to the AHCI HBA for the hard drive to process. However, whenever I do so, my system locks up and reboots. Trying to explain the process of issuing a command to an AHCI drive is far to much for this question. If needed, reference this link for the full discussion (the process is rather defined all over because there are several pieces, however, ch 4 has the data structures necessary).
Essentially, one writes the appropriate structures into memory regions defined by either the BIOS or the OS. The first memory region I should write to is the Command List Base Address contained in the register PxCLB (and PxCLBU if 64-bit addressing applies). My system is 64 bits and so I'm trying to getting both 32-bit registers. My code is essentially this:
void __iomem * pbase = ahci_port_base(ap);
u32 __iomem *temp = (u32*)(pbase + PORT_LST_ADDR);
struct ahci_cmd_hdr *cmd_hdr = NULL;
cmd_hdr = (struct ahci_cmd_hdr*)(u64)
((u64)(*(temp + PORT_LST_ADDR_HI)) << 32 | *temp);
pr_info("%s:%d cmd_list is %p\n", __func__, __LINE__, cmd_hdr);
// problems with this next line, makes the system reboot
//pr_info("%s:%d cl[0]:0x%08x\n", __func__, __LINE__, cmd_hdr->opts);
The function ahci_port_base() is found in the ahci driver (at least it is for CentOS 6.x). Basically, it returns the proper address for that port in the AHCI memory region. PORT_LST_ADDR and PORT_LST_ADDR_HI are both macros defined in that driver. The address that I get after getting both the high and low addresses is usually something like 0x0000000037900000. Is this memory address in a space that I cannot simply dereference it?
I'm hitting my head against the wall at this point because this link shows that accessing it in this manner is essentially how it's done.
The address that I get after getting both the high and low addresses
is usually something like 0x0000000037900000. Is this memory address
in a space that I cannot simply dereference it?
Yes, you are correct - that's a bus address, and you can't just dereference it because paging is enabled. (You shouldn't be just dereferencing the iomapped addresses either - you should be using readl() / writel() for those, but the breakage here is more subtle).
It looks like the right way to access the ahci_cmd_hdr in that driver is:
struct ahci_port_priv *pp = ap->private_data;
cmd_hdr = pp->cmd_slot;

In general, on ucLinux, is ioctl faster than writing to /sys filesystem?

I have an embedded system I'm working with, and it currently uses the sysfs to control certain features.
However, there is function that we would like to speed up, if possible.
I discovered that this subsystem also supports and ioctl interface, but before rewriting the code, I decided to search to see which is a faster interface (on ucLinux) in general: sysfs or ioctl.
Does anybody understand both implementations well enough to give me a rough idea of the difference in overhead for each? I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls". Or "they are roughly the same because sysfs has a very simple interface".
Update 10/24/2013:
The specific case I'm currently doing is as follows:
int fd = open("/sys/power/state",O_WRONLY);
write( fd, "standby", 7 );
close( fd );
In kernel/power/main.c, the code that handles this write looks like:
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state = PM_SUSPEND_STANDBY;
const char * const *s;
#endif
char *p;
int len;
int error = -EINVAL;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
/* First, check if we are requested to hibernate */
if (len == 7 && !strncmp(buf, "standby", len)) {
error = enter_standby();
goto Exit;
((( snip )))
Can this be sped up by moving to a custom ioctl() where the code to handle the ioctl call looks something like:
case SNAPSHOT_STANDBY:
if (!data->frozen) {
error = -EPERM;
break;
}
error = enter_standby();
break;
(so the ioctl() calls the same low-level function that the sysfs function did).
If by sysfs you mean the sysfs() library call, notice this in man 2 sysfs:
NOTES
This System-V derived system call is obsolete; don't use it. On systems with /proc, the same information can be obtained via
/proc/filesystems; use that interface instead.
I can't recall noticing stuff that had an ioctl() and a sysfs interface, but probably they exist. I'd use the proc or sys handle anyway, since that tends to be less cryptic and more flexible.
If by sysfs you mean accessing files in /sys, that's the preferred method.
I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls".
Accessing procfs or sysfs files does not entail an I/O bottleneck because they are not real files -- they are kernel interfaces. So no, accessing this stuff through "the file layer" does not affect performance. This is a not uncommon misconception in linux systems programming, I think. Programmers can be squeamish about system calls that aren't well, system calls, and paranoid that opening a file will be somehow slower. Of course, file I/O in the ABI is just system calls anyway. What makes a normal (disk) file read slow is not the calls to open, read, write, whatever, it's the hardware bottleneck.
I always use low level descriptor based functions (open(), read()) instead of high level streams when doing this because at some point some experience led me to believe they were more reliable for this specifically (reading from /proc). I can't say whether that's definitively true.
So, the question was interesting, I built a couple of modules, one for ioctl and one for sysfs, the ioctl implementing only a 4 bytes copy_from_user and nothing more, and the sysfs having nothing in its write interface.
Then a couple of userspace test up to 1 million iterations, here the results:
time ./sysfs /sys/kernel/kobject_example/bar
real 0m0.427s
user 0m0.056s
sys 0m0.368s
time ./ioctl /run/temp
real 0m0.236s
user 0m0.060s
sys 0m0.172s
edit
I agree with #goldilocks answer, HW is the real bottleneck, in a Linux environment with a well written driver choosing ioctl or sysfs doesn't make a big difference, but if you are using uClinux probably in your HW even few cpu cycles can make a difference.
The test I've done is for Linux not uClinux and it never wanted to be an absolute reference profiling the two interfaces, my point is that you can write a book about how fast is one or another but only testing will let you know, took me few minutes to setup the thing.

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks
As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.
x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).
You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.
if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}
Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.
What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Resources