linux- How to determine time taken by each function in C Program - linux

I want to check time taken by each function and system calls made by each function in my project .My code is part of user as well as kernel space. So i need time taken in both space. I am interested to know performance in terms of CPU time and Disk IO. Should i use profiler tool ? if yes , which will be more preferable ? or what other option i have ?
Please help,
Thanks

As for kernel level profiling or time taken by some instructions or functions could be measured in clock tics used. To get actual how many clock ticks have been used to do a given task could be measured by kernel function as...
#include <sys/time.h>
unsigned long ini,end;
rdtscl(ini);
...your code....
rdtscl(end);
printk("time lapse in cpu clics: %lu\n",(end-ini));
for more details http://www.xml.com/ldd/chapter/book/ch06.html
and if your code is taking more time then you can also use jiffies effectively.
And for user-space profiling you can use various timing functions whicg give the time in nanosecond resolution or oprofile(http://oprofile.sourceforge.net/about/) & refer tis Timer function to provide time in nano seconds using C++

For kernel-space function tracing and profiling (which includes a call-graph format and the time taken by individual functions), consider using the Ftrace framework.
Specifically for function profiling (within the kernel), enable the CONFIG_FUNCTION_PROFILER kernel config: under Kernel Hacking / Tracing / Kernel function profiler.
It's Help :
CONFIG_FUNCTION_PROFILER:
This option enables the kernel function profiler. A file is created
in debugfs called function_profile_enabled which defaults to zero.
When a 1 is echoed into this file profiling begins, and when a
zero is entered, profiling stops. A "functions" file is created in
the trace_stats directory; this file shows the list of functions that
have been hit and their counters.
Some resources:
Documentation/trace/ftrace.txt
Secrets of the Ftrace function tracer
Using ftrace to Identify the Process Calling a Kernel Function

Well I only develop in userspace so I don't know, how much this will help you with disk IO or Kernelspace profiling, but I profiled a lot with oprofile.
I haven't used it in a while, so I cannot give you a step by step guide, but you may find more informations here:
http://oprofile.sourceforge.net/doc/results.html
Usually this helped me finding my problems.
You may have to play a bit with the opreport output, to get the results you want.

Related

How to implement sleep utility in RISC-V?

I want to implement sleep utility that receives number of seconds as an input and pauses for given seconds on a educatational xv6 operation system that runs on risc-v processors.
The OS already have system call that get number of ticks and pauses: https://github.com/mit-pdos/xv6-riscv/blob/riscv/kernel/sysproc.c#L56
Timers are initialized using a timer vector: https://github.com/mit-pdos/xv6-riscv/blob/riscv/kernel/kernelvec.S#L93
The timer vector is initialized with CLINT_MTIMECMP function that tells timer controller when to wake the next interrupt.
What I do not understand is how to know the time between the ticks and how many ticks are done during 1 second.
Edit: A quick google of "qemu timebase riscv mtime" found a google groups chat which states that RDTIME is nanoseconds since boot and mtime is an emulated 10Mhz clock.
I haven't done a search to find the information you need, but I think I have some contextual information that would help you find it. I would recommend searching QEMU documentation / code (probably from Github search)for how mtime and mtimecmp work.
In section 10.1 (Counter - Base Counter and Timers) of specification1, it is explained that the RDTIME psuedo-instruction should have some fixed tick rate that can be determined based on the implementation 2. That tick rate would also be shared for mtimecmp and mtime as defined in the privileged specification 3.
I would presume the ticks used be the sleep system call would be the same as these ticks from the specifications. In that case, xv6 is just a kernel and wouldn't then define how many ticks/second there are. It seems that xv6 is made to run on top of qemu so the definition of ticks/second should be defined somewhere in the qemu code and might be documented.
From the old wiki for QEMU-riscv it should be clear that the SiFive CLINT defines the features xv6 needs to work, but I doubt that it specifies how to know the tickrate. Spike also supports the CLINT interface so it may also be instructive to search for the code in spike that handles it.
1 I used version 20191213 of the unprivileged specification as a reference
2
The RDTIME pseudoinstruction reads the low XLEN bits of the time CSR, which counts wall-clock
real time that has passed from an arbitrary start time in the past. RDTIMEH is an RV32I-only in-
struction that reads bits 63–32 of the same real-time counter. The underlying 64-bit counter should
never overflow in practice. The execution environment should provide a means of determining the
period of the real-time counter (seconds/tick). The period must be constant. The real-time clocks
of all harts in a single user application should be synchronized to within one tick of the real-time
clock. The environment should provide a means to determine the accuracy of the clock.
3
3.1.10
Machine Timer Registers (mtime and mtimecmp)
Platforms provide a real-time counter, exposed as a memory-mapped machine-mode read-write
register, mtime. mtime must run at constant frequency, and the platform must provide a mechanism
for determining the timebase of mtime.

Record dynamic instruction trace or histogram in QEMU?

I've written and compiled a RISC-V Linux application.
I want to dump all the instructions that get executed at run-time (which cannot be achieved by static analysis).
Is it possible to get a dynamic assembly instruction execution historgram from QEMU (or other tools)?
For instruction tracing, I go with -singlestep -d nochain,cpu, combined with some awk. This can become painfully slow and large depending on the code you run.
Regarding the statistics you'd like to obtain, delegate it to R/numpy/pandas/whatever after extracting the program counter.
The presentation or video of user "yvr18" on that topic, might cover some aspects of QEMU tracing at various levels (as well as some interesting heatmap visualization).
QEMU doesn't currently support that sort of trace of all instructions executed.
The closest we have today is that there are various bits of debug logging under the -d switch, and you can combine the tracing of "instructions translated from guest to native" with the "blocks of translated code executed" translation to work out what was executed, but this is pretty awkward.
Alternatively you could try scripting the gdbstub interface to do something like "disassemble instruction at PC; singlestep" which will (slowly!) give you all the instructions executed.
Note: There ongoing work to improve QEMU's ability to introspect guest execution so that you can write a simple 'plugin' with functions that are called back on events like guest instruction execution; with that it would be fairly easy to write a dump of guest instructions executed (or do more interesting processing), but this is still work-in-progress, so not available yet.
It seems you can do something similar with rv8 (https://github.com/rv8-io/rv8), using the command:
rv-jit -l
The "spike" RISC-V emulator allows tracing instructions executed, new values stored into registers, or just simply a histogram of PC values (from which you can extract what instruction was at each PC location).
It's not as fast as qemu, but runs at 100 to 200 MIPS on current x86 hardware (at least without tracing enabled)

Using /proc/*/stat for profiling

On Linux, a process' (main thread's) last program-counter value is presented in /proc/$PID/stat. This seems to be a really simple and easy way to do some sampled profiling without having to instrument a program in any way whatsoever.
I'm wondering if this has any caveats when it comes to the sampling quality, however. I'm assuming this value is updated whenever the process runs out of its timeslice, which should happen at completely random intervals in the program code, and that samples taken at more than time-slice length should be uniformly randomly distributed according to where the program actually spends its time. But that's just an assumption, and I realize it could be wrong in any number of ways.
Does anyone know?
Why not to try modern builtin linux tools like perf (https://perf.wiki.kernel.org/index.php/Main_Page)?
It has record mode with adjustable frequency (-F100 for 100 Hz), with many events, for example, on software event task-clock without using of hardware performance counters (stop the perf with Ctrl-C or add sleep 10 to the right to sample for 10 seconds):
perf record -p $PID -e task-clock -o perf.output.file
Perf works for all threads without any instrumenting (recompilation or code editing) and will not interfere with program execution (only timer interrupt is slightly modified). (There is also some support of stacktrace sampling with -g option.)
Output can be parsed offline with perf report (only this command will try to parse binary and shared libraries)
perf report -i perf.output.file
or converted to raw PC (EIP) samples with perf script -i perf.output.file.
PS: EIP pointer in /proc/$pid/stat file is mentioned in official linux man page 5 proc http://man7.org/linux/man-pages/man5/proc.5.html as kstkeip - "The current EIP (instruction pointer)." It is read at fs/proc/array.c:do_task_stat eip = KSTK_EIP(task);, but I'm not sure where and when it is filled. It can be written on task switch (both on involuntary when taskslice ends and voluntary when tasks does something like sched_yield) or on blocking syscalls, so it is probably not the best choice as sampling source.
If it works, which it could, it will have the shortcomings of prof, which gprof was supposed to remedy. Then gprof has its own shortcomings, which have led to numerous more modern profilers. Some of us consider this to be the most effective, and it can be accomplished with a tool as simple as pstack or lsstack.

How to Configure and Sample Intel Performance Counters In-Process

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?
Stuff I've looked into/thought about:
Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
Any help is most appreciated. Thanks.
EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.
It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

Is there a way to disable CPU cache (L1/L2) on a Linux system?

I am profiling some code on a Linux system (running on Intel Core i7 4500U) to obtain the time of ONLY the execution costs. The application is the demo mpeg2dec from libmpeg2. I am trying to obtain a probability distribution for the mpeg2 execution times. However we want to see the raw execution cost when cache is switched off.
Is there a way I can disable the cpu cache of my system via a Linux command, or via a gcc flag ? or even set the cpu (L1/L2) cache size to 0KB ? or even add some code changed to disable cache ? Of course, without modifying or rebuilding the kernel.
See this 2012 thread, someone posted a tiny kernel module source to disable cache through asm.
http://www.linuxquestions.org/questions/linux-kernel-70/disabling-cpu-caches-936077/
If disabling the cache is really necessary, then so be it.
Otherwise, to know how much time a process takes in terms of user or system "cycles", then I would recommend the getrusage() function.
struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
You can call it before/after your loop/test and subtracted the values to get a good idea of how much time your process took, even if many other processes run in parallel on the same machine. The main problem you'd get is if your process start swapping. In that case your timings will be off.
double user_usage = usage.ru_utime.tv_sec + usage.ru_utime.tv_usec / 1000000.0;
double system_uage = usage.ru_stime.tv_sec + usage.ru_stime.tv_usec / 1000000.0;
This is really precise from my own experience. To increase precision, you could be root when running your test and give it a negative priority (-1 or -2 is enough.) Then it won't be swapped out until you call a function that may require it.
Of course, you still get the effect of the cache... assuming you do not handle very large amount of data with code that goes on and on (opposed to having a loop).

Resources