I would like to profile the cache behavior of a kernel module with SystemTap (#cache references, #cache misses, etc). There is an example script online which shows how SystemTap can be used to read the perf events and counters, including cache-related ones:
https://sourceware.org/systemtap/examples/profiling/perf.stp
This sample script works by default for a process:
probe perf.hw.cache_references.process("/usr/bin/find").counter("find_insns") {}
I replaced the process keyword with module and the path to the executable with the name of my kernel module:
probe perf.hw.cache_references.module(MODULE_NAME).counter("find_insns") {}
I'm pretty sure that my module has the debug info, but running the script I get:
semantic error: while resolving probe point: identifier 'perf' at perf.stp:14:7
source: probe perf.hw.instructions.module(MODULE_NAME).counter("find_insns") {}
Any ideas what might be wrong?
Edit:
Okay, I realized that the perf counters could be bound to processes only not to modules (Explained here: https://sourceware.org/systemtap/man/stapprobes.3stap.html). Therefore I changed it back to:
probe perf.hw.cache_references.process(PATH_TO_BINARY).counter("find_insns") {}
Now, as the sample script suggests, I have:
probe module(MODULE_NAME).function(FUNC_NAME) {
#save counter values on entrance
...
}
But now running it, I get:
semantic error: perf counter 'find_insns' not defined semantic error:
while resolving probe point: identifier 'module' at perf.stp:26:7
source: probe module(MODULE_NAME).function(FUNC_NAME)
Edit2:
So here is my complete script:
#! /usr/bin/env stap
# Usage: stap perf.stp <path-to-binary> <module-name> <function-name>
global cycles_per_insn
global branch_per_insn
global cacheref_per_insn
global insns
global cycles
global branches
global cacherefs
global insn
global cachemisses
global miss_per_insn
probe perf.hw.instructions.process(#1).counter("find_insns") {}
probe perf.hw.cpu_cycles.process(#1).counter("find_cycles") {}
probe perf.hw.branch_instructions.process(#1).counter("find_branches") {}
probe perf.hw.cache_references.process(#1).counter("find_cache_refs") {}
probe perf.hw.cache_misses.process(#1).counter("find_cache_misses") {}
probe module(#2).function(#3)
{
insn["find_insns"] = #perf("find_insns")
insns <<< (insn["find_insns"])
insn["find_cycles"] = #perf("find_cycles")
cycles <<< insn["find_cycles"]
insn["find_branches"] = #perf("find_branches")
branches <<< insn["find_branches"]
insn["find_cache_refs"] = #perf("find_cache_refs")
cacherefs <<< insn["find_cache_refs"]
insn["find_cache_misses"] = #perf("find_cache_misses")
cachemisses <<< insn["find_cache_misses"]
}
probe module(#2).function(#3).return
{
dividend = (#perf("find_cycles") - insn["find_cycles"])
divisor = (#perf("find_insns") - insn["find_insns"])
q = dividend / divisor
if (q > 0)
cycles_per_insn <<< q
dividend = (#perf("find_branches") - insn["find_branches"])
q = dividend / divisor
if (q > 0)
branch_per_insn <<< q
dividend = (#perf("find_cycles") - insn["find_cycles"])
q = dividend / divisor
if (q > 0)
cacheref_per_insn <<< q
dividend = (#perf("find_cache_misses") - insn["find_cache_misses"])
q = dividend / divisor
if (q > 0)
miss_per_insn <<< q
}
probe end
{
if (#count(cycles_per_insn)) {
printf ("Cycles per Insn\n\n")
print (#hist_log(cycles_per_insn))
}
if (#count(branch_per_insn)) {
printf ("\nBranches per Insn\n\n")
print (#hist_log(branch_per_insn))
}
if (#count(cacheref_per_insn)) {
printf ("Cache Refs per Insn\n\n")
print (#hist_log(cacheref_per_insn))
}
if (#count(miss_per_insn)) {
printf ("Cache Misses per Insn\n\n")
print (#hist_log(miss_per_insn))
}
}
Systemtap can't read hardware perfctr values for kernel probes, because linux doesn't provide a suitable (e.g., atomic) internal API for safely reading those values from all contexts. The perf...process probes work only because that context is not atomic: the systemtap probe handler can block safely.
I cannot answer your detailed question about the two (?) scripts you last experimented with, because they're not complete.
Related
It's clear to me that perf always records one or more events, and the sampling can be counter-based or time-based. But when the -e and -F switches are not given, what is the default behavior of perf record? The manpage for perf-record doesn't tell you what it does in this case.
The default event is cycles, as can be seen by running perf script after perf record. There, you can also see that the default sampling behavior is time-based, since the number of cycles is not constant. The default frequency is 4000 Hz, which can be seen in the source code and checked by comparing the file size or number of samples to a recording where -F 4000 was specified.
The perf wiki says that the rate is 1000 Hz, but this is not true anymore for kernels newer than 3.4.
Default event selection in perf record is done in user-space perf tool which is usually distributed as part of linux kernel. With make perf-src-tar-gz from linux kernel source dir we can make tar gz for quick rebuild or download such tar from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf. There are also several online "LXR" cross-reference viewers for linux kernel source which can be used just like grep to learn about perf internals.
There is the function to select default event list (evlist) for perf record: __perf_evlist__add_default of tools/perf/util/evlist.c file:
int __perf_evlist__add_default(struct evlist *evlist, bool precise)
{
struct evsel *evsel = perf_evsel__new_cycles(precise);
evlist__add(evlist, evsel);
return 0;
}
Called from perf record implementation in case of zero events parsed from options: tools/perf/builtin-record.c: int cmd_record()
rec->evlist->core.nr_entries == 0 &&
__perf_evlist__add_default(rec->evlist, !record.opts.no_samples)
And perf_evsel__new_cycles will ask for hardware event cycles (PERF_TYPE_HARDWARE + PERF_COUNT_HW_CPU_CYCLES) with optional kernel sampling, and max precise (check modifiers in man perf-list, it is EIP sampling skid workarounds using PEBS or IBS):
struct evsel *perf_evsel__new_cycles(bool precise)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.exclude_kernel = !perf_event_can_profile_kernel(),
};
struct evsel *evsel;
/*
* Now let the usual logic to set up the perf_event_attr defaults
* to kick in when we return and before perf_evsel__open() is called.
*/
evsel = evsel__new(&attr);
evsel->precise_max = true;
/* use asprintf() because free(evsel) assumes name is allocated */
if (asprintf(&evsel->name, "cycles%s%s%.*s",
(attr.precise_ip || attr.exclude_kernel) ? ":" : "",
attr.exclude_kernel ? "u" : "",
attr.precise_ip ? attr.precise_ip + 1 : 0, "ppp") < 0)
return evsel;
}
In case of failed perf_event_open (no access to hardware cycles sampling, for example in virtualized environment without virtualized PMU) there is failback to software cpu-clock sampling in tools/perf/builtin-record.c: int record__open() which calls perf_evsel__fallback() of tools/perf/util/evsel.c:
bool perf_evsel__fallback(struct evsel *evsel, int err,
char *msg, size_t msgsize)
{
if ((err == ENOENT || err == ENXIO || err == ENODEV) &&
evsel->core.attr.type == PERF_TYPE_HARDWARE &&
evsel->core.attr.config == PERF_COUNT_HW_CPU_CYCLES) {
/*
* If it's cycles then fall back to hrtimer based
* cpu-clock-tick sw counter, which is always available even if
* no PMU support.
*/
scnprintf(msg, msgsize, "%s", "The cycles event is not supported, trying to fall back to cpu-clock-ticks");
evsel->core.attr.type = PERF_TYPE_SOFTWARE;
evsel->core.attr.config = PERF_COUNT_SW_CPU_CLOCK;
return true;
} ...
}
I would have assumed that access() was just a wrapper around stat(), but I've been googling around and have found some anecdotes about replacing stat calls with 'cheaper' access calls. Assuming you are only interested in checking if a file exists, is access faster? Does it completely vary by filesystem?
Theory
I doubt that.
In lower layers of kernel there is no much difference between access() and stat() calls both are performing lookup operation: they map file name to an entry in dentry cache and to inode (it is actual kernel structure, inode). Lookup is slow operation because you need to perform it for each part of path, i.e. for /usr/bin/cat you will need to lookup usr, bin and then cat and it can require reading from disk -- that is why inodes and dentries are cached in memory.
Major difference between that calls is that stat() performs conversion of inode structure to stat structure, while access() will do a simple check, but that time is small comparing with lookup time.
The real performance gain can be achieved with at-operations like faccessat() and fstatat(), which allow to open() directory once, just compare:
struct stat s;
stat("/usr/bin/cat", &s); // lookups usr, bin and cat = 3
stat("/usr/bin/less", &s); // lookups usr, bin and less = 3
int fd = open("/usr/bin"); // lookups usr, bin = 2
fstatat(fd, "cat", &s); // lookups cat = 1
fstatat(fd, "less", &s); // lookups less = 1
Experiments
I wrote a small python script which calls stat() and access():
import os, time, random
files = ['gzexe', 'catchsegv', 'gtroff', 'gencat', 'neqn', 'gzip',
'getent', 'sdiff', 'zcat', 'iconv', 'not_exists', 'ldd',
'unxz', 'zcmp', 'locale', 'xz', 'zdiff', 'localedef', 'xzcat']
access = lambda fn: os.access(fn, os.R_OK)
for i in xrange(1, 80000):
try:
random.choice((access, os.stat))("/usr/bin/" + random.choice(files))
except:
continue
I traced system with SystemTap to measure time spent in different operations. Both stat() and access() system calls use user_path_at_empty() kernel function which represents lookup operation:
stap -ve ' global tm, times, path;
probe lookup = kernel.function("user_path_at_empty")
{ name = "lookup"; pathname = user_string_quoted($name); }
probe lookup.return = kernel.function("user_path_at_empty").return
{ name = "lookup"; }
probe stat = syscall.stat
{ pathname = filename; }
probe stat, syscall.access, lookup
{ if(pid() == target() && isinstr(pathname, "/usr/bin")) {
tm[name] = local_clock_ns(); } }
probe syscall.stat.return, syscall.access.return, lookup.return
{ if(pid() == target() && tm[name]) {
times[name] <<< local_clock_ns() - tm[name];
delete tm[name];
} }
' -c 'python stat-access.py'
Here are the results:
COUNT AVG
lookup 80018 1.67 us
stat 40106 3.92 us
access 39903 4.27 us
Note that I disabled SELinux in my experiments, as it adds significant influence on the results.
The Systemtap script:
# Array to hold the list of drop points we find
global locations
# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }
# increment a drop counter for every location we drop at
#probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
# Every 5 seconds report our drop locations
probe timer.sec(5)
{
printf("\n")
foreach (l in locations-) {
printf("%d packets dropped at location %p\n",
#count(locations[l]), l)
}
delete locations
}
and the source code of kfree_skb() is:
void kfree_skb(struct sk_buff *skb)
{
if (unlikely(!skb))
return;
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))
return;
trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
I just want to know where is the $location from? And
what is the relationship between $location and kfree_skb()?Thank you.
As per the stap.1 man page:
Many types of probe points provide context variables, which are
run-time values, safely extracted from the kernel or userspace
program being probed. These are pre‐ fixed with the $
character. The CONTEXT VARIABLES section in stapprobes(3stap)
lists what is available for each type of probe point.
As per the stapprobes.3stap man page:
KERNEL TRACEPOINTS
This family of probe points hooks up to static probing
tracepoints inserted into the kernel or modules. [...]
Tracepoint probes look like: kernel.trace("name"). The
tracepoint name string, which may contain the usual wildcard
characters, is matched against the names defined by the kernel
developers in the tracepoint header files.
The handler associated with a tracepoint-based probe may read
the optional parame‐ ters specified at the macro call site.
[...] For example, the tracepoint probe kernel.trace("sched_switch")
provides the parameters $rq, $prev, and $next. [...]
The name of the tracepoint is available in $$name, and a string
of name=value pairs for all parameters of the tracepoint is
available in $$vars or $$parms.
As per the linux kernel source code:
% cd net/core
% git grep trace_kfree_skb
dev.c: [...]
drop_monitor.c: [...]
skbuff.c: [...]
% cd ../../include/trace/events
% git grep -A5 'TRACE_EVENT.*kfree_skb'
skb.h:TRACE_EVENT(kfree_skb,
skb.h-
skb.h- TP_PROTO(struct sk_buff *skb, void *location),
skb.h-
skb.h- TP_ARGS(skb, location),
skb.h-
Just documenting this: (self-answer to follow)
I'm aware that Sun's dtrace is not packaged for Ubuntu due to licensing issues; so I downloaded it and built it from source on Ubuntu - but I'm having an issue pretty much like the one in Simple dtraces not working · Issue #17 · dtrace4linux/linux · GitHub; namely loading of the driver seems fine:
dtrace-20130712$ sudo make load
tools/load.pl
23:20:31 Syncing...
23:20:31 Loading: build-2.6.38-16-generic/driver/dtracedrv.ko
23:20:34 Preparing symbols...
23:20:34 Probes available: 364377
23:20:44 Time: 13s
... however, if I try to run a simple script, it fails:
$ sudo ./build/dtrace -n 'BEGIN { printf("Hello, world"); exit(0); }'
dtrace: invalid probe specifier BEGIN { printf("Hello, world"); exit(0); }: "/path/to/src/dtrace-20130712/etc/sched.d", line 60: no symbolic type information is available for kernel`dtrace_cpu_id: Invalid argument
As per the issue link above:
(ctf requires a private and working libdwarf lib - most older releases have broken versions).
... I then built libdwarf from source, and then dtrace based on it (not trivial, requires manually finding the right placement of symlinks); and I still get the same failure.
Is it possible to fix this?
Well, after a trip to gdb, I figured that the problem occurs in dtrace's function dt_module_getctf (called via dtrace_symbol_type and, I think, dt_module_lookup_by_name). In it, I noticed that most calls propagate the attribute/variable dm_name = "linux"; but when the failure occurs, I'd get dm_name = "kernel"!
Note that original line 60 from sched.d is:
cpu_id = `dtrace_cpu_id; /* C->cpu_id; */
Then I found thr3ads.net - dtrace discuss - accessing symbols without type info [Nov 2006]; where this error message is mentioned:
dtrace: invalid probe specifier fbt::calcloadavg:entry {
printf("CMS_USER: %d, CMS_SYSTEM: %d, cpu_waitrq: %d\n",
`cpu0.cpu_acct[0], `cpu0.cpu_acct[1], `cpu0.cpu_waitrq);}: in action
list: no symbolic type information is available for unix`cpu0: No type
information available for symbol
So:
on that system, the request `cpu0.cpu_acct[0] got resolved to unix`cpu0;
and on my system, the request `dtrace_cpu_id got resolved to kernel`dtrace_cpu_id.
And since "The backtick operator is used to read the
value of kernel variables, which will be specific to the running kernel." (howto measure CPU load - DTrace General Discussion - ArchiveOrange), I thought maybe explicitly "casting" this "backtick variable" to linux would help.
And indeed it does - only a small section of sched.d needs to be changed to this:
translator cpuinfo_t < dtrace_cpu_t *C > {
cpu_id = linux`dtrace_cpu_id; /* C->cpu_id; */
cpu_pset = -1;
cpu_chip = linux`dtrace_cpu_id; /* C->cpu_id; */
cpu_lgrp = 0; /* XXX */
/* cpu_info = *((_processor_info_t *)`dtrace_zero); /* ` */ /* XXX */
};
inline cpuinfo_t *curcpu = xlate <cpuinfo_t *> (&linux`dtrace_curcpu);
... and suddenly - it starts working!:
dtrace-20130712$ sudo ./build/dtrace -n 'BEGIN { printf("Hello, world"); exit(0); }'
dtrace: description 'BEGIN ' matched 1 probe
CPU ID FUNCTION:NAME
1 1 :BEGIN Hello, world
PS:
Protip 1: NEVER do dtrace -n '::: { printf("Hello"); }' - this means "do a printf on each and every kernel event", and it will completely freeze the kernel; not even CTRL-Alt-Del will work!
Protip 2: If you want to use DTRACE_DEBUG as in Debugging DTrace, use sudo -E:
dtrace-20130712$ DTRACE_DEBUG=1 sudo -E ./build/dtrace -n 'BEGIN { printf("Hello, world"); exit(0); }'
libdtrace DEBUG: reading kernel .ctf: /path/to/src/dtrace-20130712/build-2.6.38-16-generic/linux-2.6.38-16-generic.ctf
libdtrace DEBUG: opened 32-bit /proc/kallsyms (syms=75761)
...
I'm trying to reproduce ALGOL 60 code written by Dijkstra in the paper titled "Cooperating sequential processes", the code is the first attempt to solve the mutex problem, here is the syntax:
begin integer turn; turn:= 1;
parbegin
process 1: begin Ll: if turn = 2 then goto Ll;
critical section 1;
turn:= 2;
remainder of cycle 1; goto L1
end;
process 2: begin L2: if turn = 1 then goto L2;
critical section 2;
turn:= 1;
remainder of cycle 2; goto L2
end
parend
end
So I tried to reproduce the above code in Promela and here is my code:
#define true 1
#define Aturn true
#define Bturn false
bool turn, status;
active proctype A()
{
L1: (turn == 1);
status = Aturn;
goto L1;
/* critical section */
turn = 1;
}
active proctype B()
{
L2: (turn == 2);
status = Bturn;
goto L2;
/* critical section */
turn = 2;
}
never{ /* ![]p */
if
:: (!status) -> skip
fi;
}
init
{ turn = 1;
run A(); run B();
}
What I'm trying to do is, verify that the fairness property will never hold because the label L1 is running infinitely.
The issue here is that my never claim block is not producing any error, the output I get simply says that my statement was never reached..
here is the actual output from iSpin
spin -a dekker.pml
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c
./pan -m10000
Pid: 46025
(Spin Version 6.2.3 -- 24 October 2012)
+ Partial Order Reduction
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 44 byte, depth reached 8, errors: 0
11 states, stored
9 states, matched
20 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.001 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in proctype A
dekker.pml:13, state 4, "turn = 1"
dekker.pml:15, state 5, "-end-"
(2 of 5 states)
unreached in proctype B
dekker.pml:20, state 2, "status = 0"
dekker.pml:23, state 4, "turn = 2"
dekker.pml:24, state 5, "-end-"
(3 of 5 states)
unreached in claim never_0
dekker.pml:30, state 5, "-end-"
(1 of 5 states)
unreached in init
(0 of 4 states)
pan: elapsed time 0 seconds
No errors found -- did you verify all claims?
I've read all the documentation of spin on the never{..} block but couldn't find my answer (here is the link), also I've tried using ltl{..} blocks as well (link) but that just gave me syntax error, even though its explicitly mentioned in the documentation that it can be outside the init and proctypes, can someone help me correct this code please?
Thank you
You've redefined 'true' which can't possibly be good. I axed that redefinition and the never claim fails. But, the failure is immaterial to your goal - that initial state of 'status' is 'false' and thus the never claim exits, which is a failure.
Also, it is slightly bad form to assign 1 or 0 to a bool; assign true or false instead - or use bit. Why not follow the Dijkstra code more closely - use an 'int' or 'byte'. It is not as if performance will be an issue in this problem.
You don't need 'active' if you are going to call 'run' - just one or the other.
My translation of 'process 1' would be:
proctype A ()
{
L1: turn !=2 ->
/* critical section */
status = Aturn;
turn = 2
/* remainder of cycle 1 */
goto L1;
}
but I could be wrong on that.