How do I use performance counters inside of the kernel? - linux

I want to access performance counters inside the kernel. I found many ways to use performance counters in user space, but can you tell me some way to use those in kernel space.
Please don't specify tool name, I want to write my own code, preferably a kernel module. I am using Ubuntu with kernel 3.18.1.

http://www.cise.ufl.edu/~sb3/files/pmc.pdf
http://www.cs.inf.ethz.ch/stricker/lab/doc/intel-part4.pdf
The first pdf contains description on how to use pmc.
The second contains the address of perfeventsel0 and perfeventsel1.
Ive shown an example below.U'll need to set the event number and umask as per ur requirement.
void SetUpEvent(void){
int reg_addr=0x186;
int event_no=0x0024;
int umask=0x3F00;
int enable_bits=0x430000;
int event=event_no | umask | enable_bits;
__asm__ ("wrmsr" : : "c"(reg_addr), "a"(event), "d"(0x00));
}
/* Read the performance monitor counter */
long int ReadCounter(void){
long int count;
long int eax_low, edx_high;
int reg_addr=0xC1;
__asm__("rdmsr" : "=a"(eax_low), "=d"(edx_high) : "c"(reg_addr));
count = ((long int)eax_low | (long int)edx_high<<32);
return count;
}

You should check if you CPU and other HW support you needs. Try look into oprofile source code. It have kernel module and userspace api. You can for example cut part of interesting code from oprofile kernel module part and use it into you module. I gues you module should have several reader or listeners with circle buffers for events keeping. You can also look inside linux/drivers/oprofile and to correspond linux/arch/.../oprofile. Inside make menuconfig you can config it like module or build-in and add additional timers. Available events and counters you can find under oprofile/events/ of oprofile tool (TLB_MISS, CPU_CYCLES, CYCLES_DATA_STALL, ...).
ARM Performance monitoring register
Under linux/arch/arm64/kernel/perf_regs.c you can find arm specific details.

Related

Linux device driver - memory mapped I/O example discussion

I have gone through the following topic and I still have some questions.
ioread32 followed by iowrite32 not giving same value
In the link, where can I get my base which is defined as 0xfed00000
in the post ?
what should I put for the second parameter in
void request_mem_region(unsigned long start, unsigned long len,char *name);
what should I put for the second parameter in
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
By having the Makefile and generating the kernel module, I should use the insmod and then dmesg to check if the code works as I expect, is this correct ?
In the case, should I add iounmap(virtual_base); before return 0; in the source ?
Thanks
In the link, where can I get my base which is defined as 0xfed00000 in the post ?
It's the base (physical) address of the peripheral's registers.
If the peripheral is a discrete chip on the board, then consult the board documentation.
If the peripheral is embedded in a SoC, then consult the memory map in the SoC datasheet.
what should I put for the second parameter in
void request_mem_region(unsigned long start, unsigned long len,char *name);
what should I put for the second parameter in
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
These two routines should be called with the same first and second parameters.
The length/size is the number of bytes the peripheral's register set occupies.
Sometimes the entire memory region to the next peripheral is specified.
By having the Makefile and generating the kernel module, I should use the insmod and then dmesg to check if the code works as I expect, is this correct ?
A judicious sprinkling of printk() statements is the tried & true method of testing a Linux kernel driver/module.
Unix has kdb.
In the case, should I add iounmap(virtual_base); before return 0; in the source ?
Do not copy that poorly written example of init code.
If ioremap() is performed in a driver's probe() (or other initialization) routine, then the iounmap() should be in the probe's error exit sequence and in the driver's remove() (or the complementary to init) routine.
There are numerous examples to study in the Linux kernel source. Use an online Linux cross reference such as http://lxr.free-electrons.com/source/
Note that almost all Linux drivers use iounmap() two or more times.

How to find TLS segments of the current thread on linux amd64?

I'm looking for a way to find out the memory addresses of TLS segments for the current thread on linux, amd64. Bonus point for a solution that works on OSX.
Looked into various language runtime or GC (like boehm), but couldn't go through the multiple layer of abstractions to support all kind of systems so far. Any help appreciated.
Did you have a look at the solution Martin and I came up with in druntime?
What we do there boils down to scanning the segments in the corresponding dl_phdr_info (obtained by looking for the correct one using dl_iterate_phdr) for the segment with type PT_TLS, and storing its module id and size.
You can then get the start of the address range on the current thread by calling __tls_get_addr for offset 0 and the module id (there is an offset on some archs), and the end by simply adding the size you determined to that. If you do not need to support shared libraries, you can also simply use fs/gs on x86 for that (might be required if you want to link a static executable).
This works for Linux and FreeBSD (and probably other ELF platforms), but not OS X. There, the best I could come up with so far is this:
void _d_dyld_getTLSRange(void* arbitraryTLSSymbol, void** start, size_t* size) {
dyld_enumerate_tlv_storage(
^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
assert(state == dyld_tlv_state_allocated);
if (info->tlv_addr <= arbitraryTLSSymbol &&
arbitraryTLSSymbol < (info->tlv_addr + info->tlv_size)
) {
// Found the range we are looking for.
*start = info->tlv_addr;
*size = info->tlv_size;
}
}
);
}
The naive implementation currently used in LDC's druntime does not quite handle shared libraries, though, and dyld_enumerate_tlv_storage is from dyld_priv.h, which might or might not be a problem for App Store publishing.
On Linux, the thread-specific segment is set up via arch_prtcl(ARCH_SET_FS, <addr>) call. You can find out what it was set to in the current thread via arch_prctl(ARCH_GET_FS, ...).
Bonus point for a solution that works on OSX.
OSX is a completely different OS, and uses completely different mechanism for its TLS support.

mprotect() like functionality within Linux kernel

I am in a Linux kernel module, and I allocate some memory with, say, vmalloc(). I want to make the memory have read, write, and execute permission. What is the clean and appropriate way of doing that? Basically, this is generally the equivalent of calling mprotect(), but in kernel space.
If I do the page walk, pgd_offset(), pud_offset(), pmd_offset(), pte_offset_map(), and then pte_mkwrite(), I run into linking errors when I tried it on 2.6.39. Also, it seems that if I am doing the page walk, it is a hack, and there ought to be a cleaner and more appropriate method.
My kernel module will be a loadable module, so internal symbols are not available to me.
Thanks, in advance, for your guidance.
There is a good answer to this question here: https://unix.stackexchange.com/questions/450557/is-there-any-function-analogous-to-mprotect-in-the-linux-kernel.
asm-generic/set_memory.h:int set_memory_ro(unsigned long addr, int numpages);
asm-generic/set_memory.h:int set_memory_rw(unsigned long addr, int numpages);
asm-generic/set_memory.h:int set_memory_x(unsigned long addr, int numpages);
asm-generic/set_memory.h:int set_memory_nx(unsigned long addr, int numpages);
they are defined here: https://elixir.bootlin.com/linux/v4.3/source/arch/x86/include/asm/cacheflush.h#L47
Have you tried by invoking do_mprotect() [kernel function corresponding to mprotect()] directly ?

linux gpio c api

I have an powerpc board with 3.2 kernel running on it. Accessing gpio with sysfs works as expected e.g.
> echo 242 > /sys/class/gpio/export
> cat /sys/class/gpio/gpio242/value
> 1
Is there no API to direct access gpio pins from user space? Must I deal with the text based sysfs interface?
I seach for something like:
gpio_set(int no, int val);
Thanks
Klaus
GPIO access through sysfs has been deprecated since Linux 4.8.
The new way for user space access is through libgpiod, which includes a library to link with (obviously), as well as some tools which can be run from the command line (for scripting convenience). Notably, GPIO lines are referenced with the line name string rather than an integer identifier, like with sysfs. E.g.
gpioset $(gpiofind "USR-LED-2")=1
https://git.kernel.org/pub/scm/libs/libgpiod/libgpiod.git/tree/README
Edit: sysfs direct access for GPIOs is deprecated, new way is programmatic through libgpiod
sysfs is the lowest level at which you will be able to manipulate GPIO in recent kernels. It can be a bit tedious but it offers several advantages over the old style API:
No ugly ioctl
Can be scripted very easily (think startup scripts)
For inputs, the "value" file can easily be poll-ed for rising/falling/both edges and it will be very reactive to hardware interrupts
I have no example code at the moment but when accessing them through C code, I often implemented a very simple wrapper manipulating file descriptors and having variations of the following interface:
int gpio_open(int number, int out); /* returns handle (fd) */
int gpio_close(int gpio);
int gpio_set(int gpio, int up);
int gpio_get(int gpio, int *up);
int gpio_poll(int gpio, int rising_edge, int timeout);
From then, the implementation is pretty straightforward.
Once you have the devices created in the vfs tree, you can open them like typical files assuming you have a driver written and have the correct major and minor numbers assigned in the makedev file that creates the gpio pins on the vfs tree.
Every GPIO is memory mapped as a register, so you can access to it through /dev/mem. See here. If you want to access directly to a GPIO you have to work at kernel space level

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.
Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}
I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines
It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Resources