The Systemtap script:
# Array to hold the list of drop points we find
global locations
# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }
# increment a drop counter for every location we drop at
#probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
# Every 5 seconds report our drop locations
probe timer.sec(5)
{
printf("\n")
foreach (l in locations-) {
printf("%d packets dropped at location %p\n",
#count(locations[l]), l)
}
delete locations
}
and the source code of kfree_skb() is:
void kfree_skb(struct sk_buff *skb)
{
if (unlikely(!skb))
return;
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))
return;
trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
I just want to know where is the $location from? And
what is the relationship between $location and kfree_skb()?Thank you.
As per the stap.1 man page:
Many types of probe points provide context variables, which are
run-time values, safely extracted from the kernel or userspace
program being probed. These are pre‐ fixed with the $
character. The CONTEXT VARIABLES section in stapprobes(3stap)
lists what is available for each type of probe point.
As per the stapprobes.3stap man page:
KERNEL TRACEPOINTS
This family of probe points hooks up to static probing
tracepoints inserted into the kernel or modules. [...]
Tracepoint probes look like: kernel.trace("name"). The
tracepoint name string, which may contain the usual wildcard
characters, is matched against the names defined by the kernel
developers in the tracepoint header files.
The handler associated with a tracepoint-based probe may read
the optional parame‐ ters specified at the macro call site.
[...] For example, the tracepoint probe kernel.trace("sched_switch")
provides the parameters $rq, $prev, and $next. [...]
The name of the tracepoint is available in $$name, and a string
of name=value pairs for all parameters of the tracepoint is
available in $$vars or $$parms.
As per the linux kernel source code:
% cd net/core
% git grep trace_kfree_skb
dev.c: [...]
drop_monitor.c: [...]
skbuff.c: [...]
% cd ../../include/trace/events
% git grep -A5 'TRACE_EVENT.*kfree_skb'
skb.h:TRACE_EVENT(kfree_skb,
skb.h-
skb.h- TP_PROTO(struct sk_buff *skb, void *location),
skb.h-
skb.h- TP_ARGS(skb, location),
skb.h-
Related
It's clear to me that perf always records one or more events, and the sampling can be counter-based or time-based. But when the -e and -F switches are not given, what is the default behavior of perf record? The manpage for perf-record doesn't tell you what it does in this case.
The default event is cycles, as can be seen by running perf script after perf record. There, you can also see that the default sampling behavior is time-based, since the number of cycles is not constant. The default frequency is 4000 Hz, which can be seen in the source code and checked by comparing the file size or number of samples to a recording where -F 4000 was specified.
The perf wiki says that the rate is 1000 Hz, but this is not true anymore for kernels newer than 3.4.
Default event selection in perf record is done in user-space perf tool which is usually distributed as part of linux kernel. With make perf-src-tar-gz from linux kernel source dir we can make tar gz for quick rebuild or download such tar from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf. There are also several online "LXR" cross-reference viewers for linux kernel source which can be used just like grep to learn about perf internals.
There is the function to select default event list (evlist) for perf record: __perf_evlist__add_default of tools/perf/util/evlist.c file:
int __perf_evlist__add_default(struct evlist *evlist, bool precise)
{
struct evsel *evsel = perf_evsel__new_cycles(precise);
evlist__add(evlist, evsel);
return 0;
}
Called from perf record implementation in case of zero events parsed from options: tools/perf/builtin-record.c: int cmd_record()
rec->evlist->core.nr_entries == 0 &&
__perf_evlist__add_default(rec->evlist, !record.opts.no_samples)
And perf_evsel__new_cycles will ask for hardware event cycles (PERF_TYPE_HARDWARE + PERF_COUNT_HW_CPU_CYCLES) with optional kernel sampling, and max precise (check modifiers in man perf-list, it is EIP sampling skid workarounds using PEBS or IBS):
struct evsel *perf_evsel__new_cycles(bool precise)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.exclude_kernel = !perf_event_can_profile_kernel(),
};
struct evsel *evsel;
/*
* Now let the usual logic to set up the perf_event_attr defaults
* to kick in when we return and before perf_evsel__open() is called.
*/
evsel = evsel__new(&attr);
evsel->precise_max = true;
/* use asprintf() because free(evsel) assumes name is allocated */
if (asprintf(&evsel->name, "cycles%s%s%.*s",
(attr.precise_ip || attr.exclude_kernel) ? ":" : "",
attr.exclude_kernel ? "u" : "",
attr.precise_ip ? attr.precise_ip + 1 : 0, "ppp") < 0)
return evsel;
}
In case of failed perf_event_open (no access to hardware cycles sampling, for example in virtualized environment without virtualized PMU) there is failback to software cpu-clock sampling in tools/perf/builtin-record.c: int record__open() which calls perf_evsel__fallback() of tools/perf/util/evsel.c:
bool perf_evsel__fallback(struct evsel *evsel, int err,
char *msg, size_t msgsize)
{
if ((err == ENOENT || err == ENXIO || err == ENODEV) &&
evsel->core.attr.type == PERF_TYPE_HARDWARE &&
evsel->core.attr.config == PERF_COUNT_HW_CPU_CYCLES) {
/*
* If it's cycles then fall back to hrtimer based
* cpu-clock-tick sw counter, which is always available even if
* no PMU support.
*/
scnprintf(msg, msgsize, "%s", "The cycles event is not supported, trying to fall back to cpu-clock-ticks");
evsel->core.attr.type = PERF_TYPE_SOFTWARE;
evsel->core.attr.config = PERF_COUNT_SW_CPU_CLOCK;
return true;
} ...
}
I would have assumed that access() was just a wrapper around stat(), but I've been googling around and have found some anecdotes about replacing stat calls with 'cheaper' access calls. Assuming you are only interested in checking if a file exists, is access faster? Does it completely vary by filesystem?
Theory
I doubt that.
In lower layers of kernel there is no much difference between access() and stat() calls both are performing lookup operation: they map file name to an entry in dentry cache and to inode (it is actual kernel structure, inode). Lookup is slow operation because you need to perform it for each part of path, i.e. for /usr/bin/cat you will need to lookup usr, bin and then cat and it can require reading from disk -- that is why inodes and dentries are cached in memory.
Major difference between that calls is that stat() performs conversion of inode structure to stat structure, while access() will do a simple check, but that time is small comparing with lookup time.
The real performance gain can be achieved with at-operations like faccessat() and fstatat(), which allow to open() directory once, just compare:
struct stat s;
stat("/usr/bin/cat", &s); // lookups usr, bin and cat = 3
stat("/usr/bin/less", &s); // lookups usr, bin and less = 3
int fd = open("/usr/bin"); // lookups usr, bin = 2
fstatat(fd, "cat", &s); // lookups cat = 1
fstatat(fd, "less", &s); // lookups less = 1
Experiments
I wrote a small python script which calls stat() and access():
import os, time, random
files = ['gzexe', 'catchsegv', 'gtroff', 'gencat', 'neqn', 'gzip',
'getent', 'sdiff', 'zcat', 'iconv', 'not_exists', 'ldd',
'unxz', 'zcmp', 'locale', 'xz', 'zdiff', 'localedef', 'xzcat']
access = lambda fn: os.access(fn, os.R_OK)
for i in xrange(1, 80000):
try:
random.choice((access, os.stat))("/usr/bin/" + random.choice(files))
except:
continue
I traced system with SystemTap to measure time spent in different operations. Both stat() and access() system calls use user_path_at_empty() kernel function which represents lookup operation:
stap -ve ' global tm, times, path;
probe lookup = kernel.function("user_path_at_empty")
{ name = "lookup"; pathname = user_string_quoted($name); }
probe lookup.return = kernel.function("user_path_at_empty").return
{ name = "lookup"; }
probe stat = syscall.stat
{ pathname = filename; }
probe stat, syscall.access, lookup
{ if(pid() == target() && isinstr(pathname, "/usr/bin")) {
tm[name] = local_clock_ns(); } }
probe syscall.stat.return, syscall.access.return, lookup.return
{ if(pid() == target() && tm[name]) {
times[name] <<< local_clock_ns() - tm[name];
delete tm[name];
} }
' -c 'python stat-access.py'
Here are the results:
COUNT AVG
lookup 80018 1.67 us
stat 40106 3.92 us
access 39903 4.27 us
Note that I disabled SELinux in my experiments, as it adds significant influence on the results.
I would like to profile the cache behavior of a kernel module with SystemTap (#cache references, #cache misses, etc). There is an example script online which shows how SystemTap can be used to read the perf events and counters, including cache-related ones:
https://sourceware.org/systemtap/examples/profiling/perf.stp
This sample script works by default for a process:
probe perf.hw.cache_references.process("/usr/bin/find").counter("find_insns") {}
I replaced the process keyword with module and the path to the executable with the name of my kernel module:
probe perf.hw.cache_references.module(MODULE_NAME).counter("find_insns") {}
I'm pretty sure that my module has the debug info, but running the script I get:
semantic error: while resolving probe point: identifier 'perf' at perf.stp:14:7
source: probe perf.hw.instructions.module(MODULE_NAME).counter("find_insns") {}
Any ideas what might be wrong?
Edit:
Okay, I realized that the perf counters could be bound to processes only not to modules (Explained here: https://sourceware.org/systemtap/man/stapprobes.3stap.html). Therefore I changed it back to:
probe perf.hw.cache_references.process(PATH_TO_BINARY).counter("find_insns") {}
Now, as the sample script suggests, I have:
probe module(MODULE_NAME).function(FUNC_NAME) {
#save counter values on entrance
...
}
But now running it, I get:
semantic error: perf counter 'find_insns' not defined semantic error:
while resolving probe point: identifier 'module' at perf.stp:26:7
source: probe module(MODULE_NAME).function(FUNC_NAME)
Edit2:
So here is my complete script:
#! /usr/bin/env stap
# Usage: stap perf.stp <path-to-binary> <module-name> <function-name>
global cycles_per_insn
global branch_per_insn
global cacheref_per_insn
global insns
global cycles
global branches
global cacherefs
global insn
global cachemisses
global miss_per_insn
probe perf.hw.instructions.process(#1).counter("find_insns") {}
probe perf.hw.cpu_cycles.process(#1).counter("find_cycles") {}
probe perf.hw.branch_instructions.process(#1).counter("find_branches") {}
probe perf.hw.cache_references.process(#1).counter("find_cache_refs") {}
probe perf.hw.cache_misses.process(#1).counter("find_cache_misses") {}
probe module(#2).function(#3)
{
insn["find_insns"] = #perf("find_insns")
insns <<< (insn["find_insns"])
insn["find_cycles"] = #perf("find_cycles")
cycles <<< insn["find_cycles"]
insn["find_branches"] = #perf("find_branches")
branches <<< insn["find_branches"]
insn["find_cache_refs"] = #perf("find_cache_refs")
cacherefs <<< insn["find_cache_refs"]
insn["find_cache_misses"] = #perf("find_cache_misses")
cachemisses <<< insn["find_cache_misses"]
}
probe module(#2).function(#3).return
{
dividend = (#perf("find_cycles") - insn["find_cycles"])
divisor = (#perf("find_insns") - insn["find_insns"])
q = dividend / divisor
if (q > 0)
cycles_per_insn <<< q
dividend = (#perf("find_branches") - insn["find_branches"])
q = dividend / divisor
if (q > 0)
branch_per_insn <<< q
dividend = (#perf("find_cycles") - insn["find_cycles"])
q = dividend / divisor
if (q > 0)
cacheref_per_insn <<< q
dividend = (#perf("find_cache_misses") - insn["find_cache_misses"])
q = dividend / divisor
if (q > 0)
miss_per_insn <<< q
}
probe end
{
if (#count(cycles_per_insn)) {
printf ("Cycles per Insn\n\n")
print (#hist_log(cycles_per_insn))
}
if (#count(branch_per_insn)) {
printf ("\nBranches per Insn\n\n")
print (#hist_log(branch_per_insn))
}
if (#count(cacheref_per_insn)) {
printf ("Cache Refs per Insn\n\n")
print (#hist_log(cacheref_per_insn))
}
if (#count(miss_per_insn)) {
printf ("Cache Misses per Insn\n\n")
print (#hist_log(miss_per_insn))
}
}
Systemtap can't read hardware perfctr values for kernel probes, because linux doesn't provide a suitable (e.g., atomic) internal API for safely reading those values from all contexts. The perf...process probes work only because that context is not atomic: the systemtap probe handler can block safely.
I cannot answer your detailed question about the two (?) scripts you last experimented with, because they're not complete.
Currently I can find in kern.log entries like this:
[6516247.445846] ex3.x[30901]: segfault at 0 ip 0000000000400564 sp 00007fff96ecb170 error 6 in ex3.x[400000+1000]
[6516254.095173] ex3.x[30907]: segfault at 0 ip 0000000000400564 sp 00007fff0001dcf0 error 6 in ex3.x[400000+1000]
[6516662.523395] ex3.x[31524]: segfault at 7fff80000000 ip 00007f2e11e4aa79 sp 00007fff807061a0 error 4 in libc-2.13.so[7f2e11dcf000+180000]
(You see, apps causing segfault are named ex3.x, means exercise 3 executable).
Is there a way to ask kern.log to log the complete path? Something like:
[6...] /home/user/cclass/ex3.x[3...]: segfault at 0 ip 0564 sp 07f70 error 6 in ex3.x[4...]
So I can easily figure out from who (user/student) this ex3.x is?
Thanks!
Beco
That log message comes from the kernel with a fixed format that only includes the first 16 letters of the executable excluding the path as per show_signal_msg, see other relevant lines for segmentation fault on non x86 architectures.
As mentioned by Makyen, without significant changes to the kernel and a recompile, the message given to klogd which is passed to syslog won't have the information you are requesting.
I am not aware of any log transformation or injection functionality in syslog or klogd which would allow you to take the name of the file and run either locate or file on the filesystem in order to find the full path.
The best way to get the information you are looking for is to use crash interception software like apport or abrt or corekeeper. These tools store the process metadata from the /proc filesystem including the process's commandline which would include the directory run from, assuming the binary was run with a full path, and wasn't already in path.
The other more generic way would be to enable core dumps, and then to set /proc/sys/kernel/core_pattern to include %E, in order to have the core file name including the path of the binary.
The short answer is: No, it is not possible without making code changes and recompiling the kernel. The normal solution to this problem is to instruct your students to name their executable <student user name>_ex3.x so that you can easily have this information.
However, it is possible to get the information you desire from other methods. Appleman1234 has provided some alternatives in his answer to this question.
How do we know the answer is "Not possible to the the full path in the kern.log segfault messages without recompiling the kernel":
We look in the kernel source code to find out how the message is produced and if there are any configuration options.
The files in question are part of the kernel source. You can download the entire kernel source as an rpm package (or other type of package) for whatever version of linux/debian you are running from a variety of places.
Specifically, the output that you are seeing is produced from whichever of the following files is for your architecture:
linux/arch/sparc/mm/fault_32.c
linux/arch/sparc/mm/fault_64.c
linux/arch/um/kernel/trap.c
linux/arch/x86/mm/fault.c
An example of the relevant function from one of the files(linux/arch/x86/mm/fault.c):
/*
* Print out info about fatal segfaults, if the show_unhandled_signals
* sysctl is set:
*/
static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
if (!unhandled_signal(tsk, SIGSEGV))
return;
if (!printk_ratelimit())
return;
printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
tsk->comm, task_pid_nr(tsk), address,
(void *)regs->ip, (void *)regs->sp, error_code);
print_vma_addr(KERN_CONT " in ", regs->ip);
printk(KERN_CONT "\n");
}
From that we see that the variable passed to printout the process identifier is tsk->comm where struct task_struct *tsk and regs->ip where struct pt_regs *regs
Then from linux/include/linux/sched.h
struct task_struct {
...
char comm[TASK_COMM_LEN]; /* executable name excluding path
- access with [gs]et_task_comm (which lock
it with task_lock())
- initialized normally by setup_new_exec */
The comment makes it clear that the path for the executable is not stored in the structure.
For regs->ip where struct pt_regs *regs, it is defined in whichever of the following are appropriate for your architecture:
arch/arc/include/asm/ptrace.h
arch/arm/include/asm/ptrace.h
arch/arm64/include/asm/ptrace.h
arch/cris/include/arch-v10/arch/ptrace.h
arch/cris/include/arch-v32/arch/ptrace.h
arch/metag/include/asm/ptrace.h
arch/mips/include/asm/ptrace.h
arch/openrisc/include/asm/ptrace.h
arch/um/include/asm/ptrace-generic.h
arch/x86/include/asm/ptrace.h
arch/xtensa/include/asm/ptrace.h
From there we see that struct pt_regs is defining registers for the architecture. ip is just: unsigned long ip;
Thus, we have to look at what print_vma_addr() does. It is defined in mm/memory.c
/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
/*
* Do not print if we are in atomic
* contexts (in exception stacks, etc.):
*/
if (preempt_count())
return;
down_read(&mm->mmap_sem);
vma = find_vma(mm, ip);
if (vma && vma->vm_file) {
struct file *f = vma->vm_file;
char *buf = (char *)__get_free_page(GFP_KERNEL);
if (buf) {
char *p;
p = d_path(&f->f_path, buf, PAGE_SIZE);
if (IS_ERR(p))
p = "?";
printk("%s%s[%lx+%lx]", prefix, kbasename(p),
vma->vm_start,
vma->vm_end - vma->vm_start);
free_page((unsigned long)buf);
}
}
up_read(&mm->mmap_sem);
}
Which shows us that a path was available. We would need to check that it was the path, but looking a bit further in the code gives a hint that it might not matter. We need to see what kbasename() did with the path that is passed to it. kbasename() is defined in include/linux/string.h as:
/**
* kbasename - return the last part of a pathname.
*
* #path: path to extract the filename from.
*/
static inline const char *kbasename(const char *path)
{
const char *tail = strrchr(path, '/');
return tail ? tail + 1 : path;
}
Which, even if the full path is available prior to it, chops off everything except for the last part of a pathname, leaving the filename.
Thus, no amount of runtime configuration options will permit printing out the full pathname of the file in the segment fault messages you are seeing.
NOTE: I've changed all of the links to kernel source to be to archives, rather than the original locations. Those links will get close to the code as it was at the time I wrote this, 2104-09. As should be no surprise, the code does evolve over time, so the code which is current when you're reading this may or may not be similar or perform in the way which is described here.
hi
i have used sys_getpid() from within kernel to get process id
how can I find out process name from kernel struct? does it exist in kernel??
thanks very much
struct task_struct contains a member called comm, it contains executable name excluding path.
Get current macro from this file will get you the name of the program that launched the current process (as in insmod / modprobe).
Using above info you can use get the name info.
Not sure, but find_task_by_pid_ns might be useful.
My kernel module loads with "modprobe -v my_module --allow-unsupported -o some-data" and I extract the "some-data" parameter. The following code gave me the entire command line, and here is how I parsed out the parameter of interest:
struct mm_struct *mm;
unsigned char x, cmdlen;
mm = get_task_mm(current);
down_read(&mm->mmap_sem);
cmdlen = mm->arg_end - mm->arg_start;
for(x=0; x<cmdlen; x++) {
if(*(unsigned char *)(mm->arg_start + x) == '-' && *(unsigned char *)(mm->arg_start + (x+1)) == 'o') {
break;
}
}
up_read(&mm->mmap_sem);
if(x == cmdlen) {
printk(KERN_ERR "inject: ERROR - no target specified\n");
return -EINVAL;
}
strcpy(target,(unsigned char *)(mm->arg_start + (x+3)));
"target" holds the string after the -o parameter. You can compress this somewhat - the caller (in this case, modprobe) will be the first string in mm->arg_start - to suit your needs.
you can look at the special files in /proc/<pid>/
For example, /proc/<pid>/exe is a symlink pointing to the actual binary.
/proc/<pid>/cmdline is a null-delimited list of the command line, so the first word is the process name.