How does Linux Process Accounting (psacct) work? - linux

I find a lot documents about psacct, but they are addressing usage, not how it works.
Question
I really want to know how process accounting works:
Which part of the system records information about processes?
How does it work?
Already done
I installed psacct on RHEL 6.5.
The service staring script actually (/etc/init.d/psacct) call this:
/sbin/accton $ACCTFILE
The /sbin/accton calls system call acct()
man acct
DESCRIPTION
The acct() system call enables or disables process accounting. If called with the name of an existing file as its argument, accounting is
turned on, and records for each terminating process are appended to filename as it terminates. An argument of NULL causes accounting to be
turned off.

The answer to your question is in the linux source file kernel/acct.c. Particularly in the fill_ac function
/*
* Write an accounting entry for an exiting process
*
* The acct_process() call is the workhorse of the process
* accounting system. The struct acct is built here and then written
* into the accounting file. This function should only be called from
* do_exit() or when switching to a different output file.
*/
static void fill_ac(acct_t *ac)

Related

Full form of ttwu in the scheduling code of the linux kernel

I know this is kind of silly, but I tried to find out online, but couldnt.
What is the full-form of ttwu in the scheduler code of the linux kernel. It can be seen as a number of function prefixes, namely,
ttwu_do_wakeup
ttwu_do_activate
ttwu_queue_remote
ttwu_activate
.. and many more
I would assume it stands for try_to_wake_up. See for example the comment in kernel/sched/sched.h:
981 /* try_to_wake_up() stats */
982 unsigned int ttwu_count;
983 unsigned int ttwu_local;
yes in true *nix philosophy why waste time on extra characters (e.g. you want to know the current working directory? Use pwd for "print working directory") TTWU is indeed "Try To Wake Up" and implemented in the Linux scheduler code, eventually calling activate_task, which actually DOES NOT DO ANYTHING but put the task on the run queue of one of the CPUs. At some point in the future the _schedule function will make it activate (via switch_context.) Pretty cool stuff if you ask me.

Can eBPF modify the return value or parameters of a syscall?

To simulate some behavior I would like to attach a probe to a syscall and modify the return value when certain parameters are passed. Alternatively, it would also be enough to modify the parameters of the function before they are processes.
Is this possible with BPF?
Within kernel probes (kprobes), the eBPF virtual machine has read-only access to the syscall parameters and return value.
However the eBPF program will have a return code of it's own. It is possible to apply a seccomp profile that traps BPF (NOT eBPF; thanks #qeole) return codes and interrupt the system call during execution.
The allowed runtime modifications are:
SECCOMP_RET_KILL: Immediate kill with SIGSYS
SECCOMP_RET_TRAP: Send a catchable SIGSYS, giving a chance to emulate the syscall
SECCOMP_RET_ERRNO: Force errno value
SECCOMP_RET_TRACE: Yield decision to ptracer or set errno to -ENOSYS
SECCOMP_RET_ALLOW: Allow
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
The SECCOMP_RET_TRACE method enables modifying the system call performed, arguments, or return value. This is architecture dependent and modification of mandatory external references may cause an ENOSYS error.
It does so by passing execution up to a waiting userspace ptrace, which has the ability to modify the traced process memory, registers, and file descriptors.
The tracer needs to call ptrace and then waitpid. An example:
ptrace(PTRACE_SETOPTIONS, tracee_pid, 0, PTRACE_O_TRACESECCOMP);
waitpid(tracee_pid, &status, 0);
http://man7.org/linux/man-pages/man2/ptrace.2.html
When waitpid returns, depending on the contents of status, one can retrieve the seccomp return value using the PTRACE_GETEVENTMSG ptrace operation. This will retrieve the seccomp SECCOMP_RET_DATA value, which is a 16-bit field set by the BPF program. Example:
ptrace(PTRACE_GETEVENTMSG, tracee_pid, 0, &data);
Syscall arguments can be modified in memory before continuing operation. You can perform a single syscall entry or exit with the PTRACE_SYSCALL step. Syscall return values can be modified in userspace before resuming execution; the underlying program won't be able to see that the syscall return values have been modified.
An example implementation:
Filter and Modify System Calls with seccomp and ptrace
I believe that attaching eBPF to kprobes/kretprobes gives you read access to function arguments and return values, but that you cannot tamper with them. I am NOT 100% sure; good places to ask for confirmation would be the IO Visor project mailing list or IRC channel (#iovisor at irc.oftc.net).
As an alternative solution, I know you can at least change the return value of a syscall with strace, with the -e option. Quoting the manual page:
-e inject=set[:error=errno|:retval=value][:signal=sig][:when=expr]
Perform syscall tampering for the specified set of syscalls.
Also, there was a presentation on this, and fault injection, at Fosdem 2017, if it is of any interest to you. Here is one example command from the slides:
strace -P precious.txt -efault=unlink:retval=0 unlink precious.txt
Edit: As stated by Ben, eBPF on kprobes and tracepoints is definitively read only, for tracing and monitoring use cases. I also got confirmation about this on IRC.
It is possible to modify some user space memory using eBPF. As stated in the bpf.h header file:
* int bpf_probe_write_user(void *dst, const void *src, u32 len)
* Description
* Attempt in a safe way to write *len* bytes from the buffer
* *src* to *dst* in memory. It only works for threads that are in
* user context, and *dst* must be a valid user space address.
*
* This helper should not be used to implement any kind of
* security mechanism because of TOC-TOU attacks, but rather to
* debug, divert, and manipulate execution of semi-cooperative
* processes.
*
* Keep in mind that this feature is meant for experiments, and it
* has a risk of crashing the system and running programs.
* Therefore, when an eBPF program using this helper is attached,
* a warning including PID and process name is printed to kernel
* logs.
* Return
* 0 on success, or a negative error in case of failure.
Also, quoting from the BPF design Q&A:
Tracing BPF programs can overwrite the user memory of the current
task with bpf_probe_write_user(). Every time such program is loaded
the kernel will print warning message, so this helper is only useful
for experiments and prototypes. Tracing BPF programs are root only.
Your eBPF may write data into user space memory locations. Note that you still cannot modify kernel structures from within you eBPF program.
It is possible to inject errors into a system call invocation using eBPF: https://lwn.net/Articles/740146/
There is a bpf function called bpf_override_return(), which can override the return value of an invocation. This is an example using bcc as the front-end: https://github.com/iovisor/bcc/blob/master/tools/inject.py
According to the Linux manual page:
bpf_override_return() is only available if the kernel was compiled with the CONFIG_BPF_KPROBE_OVERRIDE configuration option, and in this case it only works on functions tagged with ALLOW_ERROR_INJECTION in the kernel code.
Also, the helper is only available for the architectures having the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing, x86 architecture is the only one to support this feature.
It is possible to add a function to the error injection framework. More information could be found here: https://github.com/iovisor/bcc/issues/2485

Is there any alterntive to wait3 to get rusage structure in shell scripting?

I was trying to monitor the peak memory usage of a child process.time -v is an option,but it is not working in solaris.So is there any way to get details that are in rusage structure from shell scripting?
You can use /usr/bin/timex
From the /usr/bin/timex man page:
The given command is executed; the elapsed time, user time and system
time spent in execution are reported in seconds. Optionally, process
accounting data for the command and all its children can be listed or
summarized, and total system activity during the execution interval
can be reported.
...
-p List process accounting records for command and all its children. This option works only if the process accounting software is installed. Suboptions f, h, k, m, r, and t modify the data items
reported. The options are as follows:
...
Start with the man page for acctadm to get process accounting enabled.
Note that on Solaris, getrusage() and wait3() do not return memory usage statistics. See the (somewhat dated) getrusage() source code at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/syscall/rusagesys.c and the wait3() source code at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libbc/libc/sys/common/wait.c#158 (That's actually OpenSolaris source, which Oracle dropped support for, and it may not represent the current Solaris implementation, although a few tests on Solaris 11.2 show that the RSS data is in fact still zero.)
Also, from the Solaris getrusage() man page:
The ru_maxrss, ru_ixrss, ru_idrss, and ru_isrss members of the
rusage structure are set to 0 in this implementation.
There are almost certainly other ways to get the data, such as dtrace.
Edit:
dtrace doesn't look to be much help, unfortunately. Attempting to run this dtrace script with dtrace -s memuse.d -c bash
#!/usr/sbin/dtrace -s
#pragma D option quiet
profile:::profile-1001hz
/ pid == $target /
{
#pct[ pid ] = max( curpsinfo->pr_pctmem );
}
dtrace:::END
{
printa( "pct: %#u %a\n", #pct );
}
resulted in the following error message:
dtrace: failed to compile script memuse.d: line 8: translator does not define conversion for member: pr_pctmem
dtrace on Solaris doesn't appear to provide access to process memory usage. In fact, the Solaris 11.2 /usr/lib/dtrace/procfs.d translator for procfs data has this comment in it:
/*
* Translate from the kernel's proc_t structure to a proc(4) psinfo_t struct.
* We do not provide support for pr_size, pr_rssize, pr_pctcpu, and pr_pctmem.
* We also do not fill in pr_lwp (the lwpsinfo_t for the representative LWP)
* because we do not have the ability to select and stop any representative.
* Also, for the moment, pr_wstat, pr_time, and pr_ctime are not supported,
* but these could be supported by DTrace in the future using subroutines.
* Note that any member added to this translator should also be added to the
* kthread_t-to-psinfo_t translator, below.
*/
Browsing the Illumos.org source code, searching for ps_rssize, indicates that the procfs data is computed only when needed, and not updated continually as the process runs. (See http://src.illumos.org/source/search?q=pr_rssize&defs=&refs=&path=&hist=&project=illumos-gate)

Function graph (timestamped entry and exit) for both user, library and kernel space in Linux?

I'm writing this more-less in frustration - but who knows, maybe there's a way for this too...
I would like to analyze what happens with a function from ALSA, say snd_pcm_readi; for that purpose, let's say I have prepared a small testprogram.c, where I have this:
void doCapture() {
ret = snd_pcm_readi(handle, buffer, period_size);
}
The problem with this function is that it eventually (should) hook into snd_pcm_readi in the shared system library /usr/lib/libasound.so; from there, I believe via ioctl, it would somehow communicate to snd_pcm_read in the kernel module /lib/modules/$(uname -r)/kernel/sound/core/snd-pcm.ko -- and that should ultimately talk to whatever .ko kernel module which is a driver for a particular soundcard.
Now, with the organization like above, I can do something like:
valgrind --tool=callgrind --toggle-collect=doCapture ./testprogram
... and then kcachegrind callgrind.out.12406 does indeed reveal a relationship between snd_pcm_readi, libasound.so and an ioctl (I cannot get the same information to show with callgrind_annotate) - so that somewhat covers userspace; but that is as far as it goes. Furthermore, it produces a call graph, that is to say general caller/callee relationships between functions (possibly by a count of samples/ticks each function has spent working as scheduled).
However, what I would like to get instead, is something like the output of the Linux ftrace tracer called function_graph, which provides a timestamped entry and exit of traced kernel functions... example from ftrace: add documentation for function graph tracer [LWN.net]:
$ cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
2105.963678 | 0) | mutex_unlock() {
2105.963682 | 0) 5.715 us | __mutex_unlock_slowpath();
2105.963693 | 0) + 14.700 us | }
2105.963698 | 0) | dnotify_parent() {
(NB: newer ftrace documentation seems to not show a timestamp at first for the function\_graph, only duration - but I think it's still possible to modify that)
With ftrace, one can filter so one can only trace functions in a given kernel module - so in my case, I could add the functions of snd-pcm.ko and whatever .ko module is the soundcard driver, and I'd have whatever I find interesting in kernel-space covered. But then, I lose the link to the user-space program (unless I explicitly printf to /sys/kernel/debug/tracing/trace_marker, or do a trace_printk from user-space .c files)
Ultimately, what I'd like, is to have the possibility to specify an executable, possibly also library files and kernel modules - and obtain a timestamped function graph (with indented/nested entry and exit per function) like ftrace provides. Are there any alternatives for something like this? (Note I can live without the function exits - but I'd really like to have timestamped function entries)
As a PS: it seems I actually found something that fits the description, which is the fulltrace application/script:
fulltrace [andreoli#Github]
fulltrace traces the execution of an ELF program, providing as output a full trace of its userspace, library and kernel function calls. ...
(prerequisites) the following kernel configuration options and their dependencies must be set as enabled (=y): FTRACE, TRACING_SUPPORT, UPROBES, UPROBE_EVENT, FUNCTION_GRAPH_TRACER.
Sounds perfect - but the problem is, I'm on Ubuntu 11.04, and while this 2.6.38 kernel luckily has CONFIG_FTRACE=y enabled -- its /boot/config-`uname -r`
doesn't even mention UPROBES :/ And since I'd like to avoid doing kernel hacking, unfortunately I cannot use this script...
(Btw, if UPROBES were available, (as far as I understand) one sets a trace probe on a symbol address (as obtained from say objdump -d), and output goes again to /sys/kernel/debug/tracing/trace - so some custom solution would have been possible using UPROBES, even without the fulltrace script)
So, to narrow down my question a bit - is there a solution, that would allow simultaneous user-space (incl. shared libraries) and kernel-space "function graph" tracing, but where UPROBES are not available in the kernel?

"cat" command killed when reading from a Linux device driver

I have an assignment in my Operating Systems class to make a simple pseudo-stack Linux device driver. So for an example, if I was to write "Hello" to the device driver, it would return "olleH" when I read from it. We have to construct a tester program in C to just call upon the read/write functions of the device driver to just demonstrate that it functions in a FILO manner. I have done all of this, and my tester program, in my opinion, demonstrates the purpose of the assignment; however, out of curiosity, inside BASH I execute the following commands:
echo "Test" > /dev/driver
cat /dev/driver
where /dev/driver is the special file I created using "mknod". However, when I do this, I get a black screen full of errors. After I swap back to the GUI view using CNTRL+ALT+F7, I see that BASH has returned "Killed".
Does anyone know what could be causing this to happen? I am confused since my tester program calls open(), read(), and write() with everything functioning as it should.
If I need to show some code, just ask.
The function in your device driver that writes to the buffer you are providing it is most likely causing this issue.
To debug, you can do the following:
First, make sure the read part is fine. You can printk your internal buffer after you read from input to ensure this.
Second, in your write function, printk some information instead of actually writing anything and make sure everything is fine.
Also, make sure the writer makes it clear that the write has ended. I'm not particularly sure about device drivers, but you either need to return 0 as the number of bytes written when called a second time, or set an eof variable (if that is one of the arguments to your function)

Resources