how to record cr3 register value using linux performance tools perf? - linux

perf is able to record multiple fields such as addr, ip, timestamp. It also can record general registers as seen at https://github.com/torvalds/linux/blob/master/tools/perf/arch/x86/util/perf_regs.c. But I can't find any related document about recording control registers using perf. So how can I achieve that using perf? Are there any other tools available?

You cannot record control register values using perf tools. The list of registers that you can sample using --intr-regs option is limited to the registers listed here. You can confirm this by looking here.
The registers that can be accessed by the perf events module is architecture dependent, as can be seen here and here. Including selective register states into the perf record/script output has been introduced by this commit. This means, all of perf would be limited to using the registers that have been specified and nothing more.
There are other questions/answers here, that tell you some ways of writing a program/kernel module to access the control registers. On top of this, you can use QEMU (in TCG mode) and run your program inside the VM. You can then print register state periodically (at the end of each TB - where you'll see all register values). There are designated emulators like GDB also, which might help you.
Edit -
There is one way by which cr3 register values can be recorded. You can use IntelPT to record control flow information for a program, during its execution. IntelPT tracks changes to CR3 registers with the help of the PIP packet. You can use the traces generated by IntelPT to track and determine the CR3 values.

Related

Record dynamic instruction trace or histogram in QEMU?

I've written and compiled a RISC-V Linux application.
I want to dump all the instructions that get executed at run-time (which cannot be achieved by static analysis).
Is it possible to get a dynamic assembly instruction execution historgram from QEMU (or other tools)?
For instruction tracing, I go with -singlestep -d nochain,cpu, combined with some awk. This can become painfully slow and large depending on the code you run.
Regarding the statistics you'd like to obtain, delegate it to R/numpy/pandas/whatever after extracting the program counter.
The presentation or video of user "yvr18" on that topic, might cover some aspects of QEMU tracing at various levels (as well as some interesting heatmap visualization).
QEMU doesn't currently support that sort of trace of all instructions executed.
The closest we have today is that there are various bits of debug logging under the -d switch, and you can combine the tracing of "instructions translated from guest to native" with the "blocks of translated code executed" translation to work out what was executed, but this is pretty awkward.
Alternatively you could try scripting the gdbstub interface to do something like "disassemble instruction at PC; singlestep" which will (slowly!) give you all the instructions executed.
Note: There ongoing work to improve QEMU's ability to introspect guest execution so that you can write a simple 'plugin' with functions that are called back on events like guest instruction execution; with that it would be fairly easy to write a dump of guest instructions executed (or do more interesting processing), but this is still work-in-progress, so not available yet.
It seems you can do something similar with rv8 (https://github.com/rv8-io/rv8), using the command:
rv-jit -l
The "spike" RISC-V emulator allows tracing instructions executed, new values stored into registers, or just simply a histogram of PC values (from which you can extract what instruction was at each PC location).
It's not as fast as qemu, but runs at 100 to 200 MIPS on current x86 hardware (at least without tracing enabled)

How to Configure and Sample Intel Performance Counters In-Process

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?
Stuff I've looked into/thought about:
Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
Any help is most appreciated. Thanks.
EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.
It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

Free BSD: What is a global flag

Does anybody know what does global flag mean in context of operating system, how check their statuses and how to add custom, if of course it is possible.
I couldn't find any comprehensive information about that.
For more clarification I am looking at FreeBSD
Possibly it's referring to the sysctl interface, which allows you to set and query certain system-wide settings (not all of them are changeable).
Among these are the maximum number of processes or files, maximum files per process, clock rate, hostname and so on.
Refer to sysctl(8) for further details.
Note that that link is for the command-line tool used for changing system parameters. There is also an API if you wish to do it from C code.
Alternatively, it may simply be suggesting you add a syscall which maintains some state. This could be as simple as an integer which could be set and/or retrieved via the syscall interface.
If that's the case, you will find plenty of information and tutorials on the web with a Google search of freebsd syscall add, including a few here on Stack Overflow itself.

Minimal core dump (stack trace + current frame only)

Can I configure what goes into a core dump on Linux? I want to obtain something like the Windows mini-dumps (minimal information about the stack frame when the app crashed). I know you can set a max size for the core files using ulimit, but this does not allow me to control what goes inside the core (i.e. there is no guarantee that if I set the limit to 64kb it will dump the last 16 pages of the stack, for example).
Also, I would like to set it in a programmatic way (from code), if possible.
I have looked at the /proc/PID/coredump_filter file mentioned by man core, but it seems too coarse grained for my purposes.
To provide a little context: I need tiny core files, for multiple reasons: I need to collect them over the network, for numerous (thousands) of clients; furthermore, these are embedded devices with little SD cards, and GPRS modems for the network connection. So anything above ~200k is out of question.
EDIT: I am working on an embedded device which runs linux 2.6.24. The processor is PowerPC. Unfortunately, powerpc-linux is not supported in breakpad at the moment, so google breakpad is not an option
I have "solved" this issue in two ways:
I installed a signal handler for SIGSEGV, and used backtrace/backtrace_symbols to print out the stack trace. I compiled my code with -rdynamic, so even after stripping the debug info I still get a backtrace with meaningful names (while keeping the executable compact enough).
I stripped the debug info and put it in a separate file, which I will store somewhere safe, using strip; from there, I will use add22line with the info saved from the backtrace (addresses) to understand where the problem happened. This way I have to store only a few bytes.
Alternatively, I found I could use the /proc/self/coredump_filter to dump no memory (setting its content to "0"): only thread and proc info, registers, stacktrace etc. are saved in the core. See more in this answer
I still lose information that could be precious (global and local variable(s) content, params..). I could easily figure out which page(s) to dump, but unfortunately there is no way to specify a "dump-these-pages" for normal core dumps (unless you are willing to go and patch the maydump() function in the kernel).
For now, I'm quite happy with there 2 solutions (it is better than nothing..) My next moves will be:
see how difficult would be to port Breakpad to powerpc-linux: there are already powerpc-darwin and i386-linux so.. how hard can it be? :)
try to use google-coredumper to dump only a few pages around the current ESP (that should give me locals and parameters) and around "&some_global" (that should give me globals).

Basic doubt in Oprofile

I am trying to profile my software (in Linux) with oprofile. My software consists of both userspace and kernel module. First my doubt is what does the --separate=kernel option do? What will be the difference when running without that option? I did try to see it but couldn't find any difference. Could you please post an example?
Can't i profile a kernel module without the --seperate=kernel option?
Thanks,
Bala
In oprofile when used with option --seperate=kernel, it seperates the kernel and kernel modules per application.
--seperate='library' seperates the samples for the dynamically linked object per application basis.
kernel, dynamically linked object are just not specific to the application we want to profile alone. But at the same time our application spends considerable amount of time in them.
So --seperate allows one to get the samples from the point of view of the application we are interested in profiling. It can also give samples based on individual threads also.
Kernel can be profiled by providing --vmlinux option to opcontrol.
Ex:- opcontrol --vmlinux=/boot/vmlinux-2.6.27.23-0.1-preempt
--seperate is additional option that allows us to see the samples at different resolutions.

Resources