Use linux perf utility to report counters every second like vmstat - linux

There is perf command-linux utility in Linux to access hardware performance-monitoring counters, it works using perf_events kernel subsystems.
perf itself has basically two modes: perf record/perf top to record sampling profile (the sample is for example every 100000th cpu clock cycle or executed command), and perf stat mode to report total count of cycles/executed commands for the application (or for the whole system).
Is there mode of perf to print system-wide or per-CPU summary on total count every second (every 3, 5, 10 seconds), like it is printed in vmstat and systat-family tools (iostat, mpstat, sar -n DEV... like listed in http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html)? For example, with cycles and instructions counters I will get mean IPC for every second of system (or of every CPU).
Is there any non-perf tool (in https://perf.wiki.kernel.org/index.php/Tutorial or http://www.brendangregg.com/perf.html) which can get such statistics with perf_events kernel subsystem? What about system-wide per-process IPC calculation with resolution of seconds?

There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
...
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1211
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1214
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1217
1218 return 0;
1219 }

Related

capturing pid that is changing frequently [duplicate]

I want to know the CPU utilization of a process and all the child processes, for a fixed period of time, in Linux.
To be more specific, here is my use-case:
There is a process which waits for a request from the user to execute the programs. To execute the programs, this process invokes child processes (maximum limit of 5 at a time) & each of this child process executes 1 of these submitted programs (let's say user submitted 15 programs at once). So, if user submits 15 programs, then 3 batches of 5 child processes each will run. Child processes are killed as soon as they finish their execution of the program.
I want to know about % CPU Utilization for the parent process and all its child process during the execution of those 15 programs.
Is there any simple way to do this using top or another command? (Or any tool i should attach to the parent process.)
You can find this information in /proc/PID/stat where PID is your parent process's process ID. Assuming that the parent process waits for its children then the total CPU usage can be calculated from utime, stime, cutime and cstime:
utime %lu
Amount of time that this process has been scheduled in user mode,
measured in clock ticks (divide by sysconf(_SC_CLK_TCK). This includes
guest time, guest_time (time spent running a virtual CPU, see below),
so that applications that are not aware of the guest time field do not
lose that time from their calculations.
stime %lu
Amount of time that this process has been scheduled in kernel mode,
measured in clock ticks (divide by sysconf(_SC_CLK_TCK).
cutime %ld
Amount of time that this process's waited-for children have been
scheduled in user mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK). (See also times(2).) This includes guest time,
cguest_time (time spent running a virtual CPU, see below).
cstime %ld
Amount of time that this process's waited-for children have been
scheduled in kernel mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK).
See proc(5) manpage for details.
And of course you can do it in hardcore-way using good old C
find_cpu.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define MAX_CHILDREN 100
/**
* System command execution output
* #param <char> command - system command to execute
* #returb <char> execution output
*/
char *system_output (const char *command)
{
FILE *pipe;
static char out[1000];
pipe = popen (command, "r");
fgets (out, sizeof(out), pipe);
pclose (pipe);
return out;
}
/**
* Finding all process's children
* #param <Int> - process ID
* #param <Int> - array of childs
*/
void find_children (int pid, int children[])
{
char empty_command[] = "/bin/ps h -o pid --ppid ";
char pid_string[5];
snprintf(pid_string, 5, "%d", pid);
char *command = (char*) malloc(strlen(empty_command) + strlen(pid_string) + 1);
sprintf(command, "%s%s", empty_command, pid_string);
FILE *fp = popen(command, "r");
int child_pid, i = 1;
while (fscanf(fp, "%i", &child_pid) != EOF)
{
children[i] = child_pid;
i++;
}
}
/**
* Parsign `ps` command output
* #param <char> out - ps command output
* #return <int> cpu utilization
*/
float parse_cpu_utilization (const char *out)
{
float cpu;
sscanf (out, "%f", &cpu);
return cpu;
}
int main(void)
{
unsigned pid = 1;
// getting array with process children
int process_children[MAX_CHILDREN] = { 0 };
process_children[0] = pid; // parent PID as first element
find_children(pid, process_children);
// calculating summary processor utilization
unsigned i;
float common_cpu_usage = 0.0;
for (i = 0; i < sizeof(process_children)/sizeof(int); ++i)
{
if (process_children[i] > 0)
{
char *command = (char*)malloc(1000);
sprintf (command, "/bin/ps -p %i -o 'pcpu' --no-headers", process_children[i]);
common_cpu_usage += parse_cpu_utilization(system_output(command));
}
}
printf("%f\n", common_cpu_usage);
return 0;
}
Compile:
gcc -Wall -pedantic --std=gnu99 find_cpu.c
Enjoy!
Might not be the exact command. But you can do something like below to get cpu usage of various process and add it.
#ps -C sendmail,firefox -o pcpu= | awk '{s+=$1} END {print s}'
/proc/[pid]/stat Status information about the process. This is used by ps and made into human readable form.
Another way is to use cgroups and use cpuacct.
http://www.kernel.org/doc/Documentation/cgroups/cpuacct.txt
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpuacct.html
Here's one-liner to compute total CPU for all processes. You can adjust it by passing column filter into top output:
top -b -d 5 -n 2 | awk '$1 == "PID" {block_num++; next} block_num == 2 {sum += $9;} END {print sum}'

Reliability of Linux kernel add_timer at resolution of one jiffy?

In the code given below, there is a simple Linux kernel module (driver) which calls a function repeatedly 10 times, using add_timer at resolution of 1 jiffy (that is, the timer is scheduled to fire at jiffies + 1). Using the bash script rerun.sh, then I obtain timestamps from printout in syslog, and visualize them using gnuplot.
In most cases, I get a syslog output like this:
[ 7103.055787] Init testjiffy: 0 ; HZ: 250 ; 1/HZ (ms): 4
[ 7103.056044] testjiffy_timer_function: runcount 1
[ 7103.060045] testjiffy_timer_function: runcount 2
[ 7103.064052] testjiffy_timer_function: runcount 3
[ 7103.068050] testjiffy_timer_function: runcount 4
[ 7103.072053] testjiffy_timer_function: runcount 5
[ 7103.076036] testjiffy_timer_function: runcount 6
[ 7103.080044] testjiffy_timer_function: runcount 7
[ 7103.084044] testjiffy_timer_function: runcount 8
[ 7103.088060] testjiffy_timer_function: runcount 9
[ 7103.092059] testjiffy_timer_function: runcount 10
[ 7104.095429] Exit testjiffy
... which results with time series and delta histogram plots like these:
This is, essentially, the quality of timing that I'd expect from the code.
However - every once in a while, I get a capture like:
[ 7121.377507] Init testjiffy: 0 ; HZ: 250 ; 1/HZ (ms): 4
[ 7121.380049] testjiffy_timer_function: runcount 1
[ 7121.384062] testjiffy_timer_function: runcount 2
[ 7121.392053] testjiffy_timer_function: runcount 3
[ 7121.396055] testjiffy_timer_function: runcount 4
[ 7121.400068] testjiffy_timer_function: runcount 5
[ 7121.404085] testjiffy_timer_function: runcount 6
[ 7121.408084] testjiffy_timer_function: runcount 7
[ 7121.412072] testjiffy_timer_function: runcount 8
[ 7121.416083] testjiffy_timer_function: runcount 9
[ 7121.420066] testjiffy_timer_function: runcount 10
[ 7122.417325] Exit testjiffy
... which results with a rendering like:
... and I'm like: "WHOOOOOAAAAAA ... wait a second..." - isn't there a pulse dropped from the sequence? Meaning that add_timer missed a slot, and then fired up the function in the next 4 ms slot?
The interesting thing is, that in running these tests, I have nothing else but a terminal, web browser and a text editor started up - so I cannot really see anything running, that may hog the OS/kernel; and thus, I really cannot see a reason why the kernel would make such a big miss (of an entire jiffy period). When I read about Linux kernel timing, e.g. "The simplest and least accurate of all timers ... is the timer API", I read that "least accurate" as: "don't expect exactly 4 ms period" (as per this example) - and I don't, I'm fine with the variance shown in the (first) histogram; but I don't expect that a whole period will be missed!?
So my question(s) are:
Is this expected behavior from add_timer at this resolution (that a period can occasionally be missed)?
If so, is there a way to "force" add_timer to fire the function at each 4ms slot, as specified by a jiffy on this platform?
Is it possible that I get a "wrong" timestamp - e.g. the timestamp reflecting when the actual "print" to syslog happened, rather than when the function actually fired?
Note that I'm not looking for a period resolution below what corresponds to a jiffy (in this case, 4ms); nor am I looking to decrease the delta variance when the code works properly. So as I see it, I don't have "high resolution timer" demands, nor "hard real-time" demands - I just want add_timer to fire reliably. Would that be possible on this platform, without resorting to special "real-time" configurations of the kernel?
Bonus question: in rerun.sh below, you'll note two sleeps marked with MUSTHAVE; if either of them is left out/commented, OS/kernel freezes, and requires a hard reboot. And I cannot see why - is it really possible that running rmmod after insmod from bash is so fast, that it will conflict with the normal process of module loading/unloading?
Platform info:
$ cat /proc/cpuinfo | grep "processor\|model name\|MHz\|cores"
processor : 0 # (same for 1)
model name : Intel(R) Atom(TM) CPU N450 # 1.66GHz
cpu MHz : 1000.000
cpu cores : 1
$ echo $(cat /etc/issue ; uname -a)
Ubuntu 11.04 \n \l Linux mypc 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 18:00:43 UTC 2012 i686 i686 i386 GNU/Linux
$ echo $(lsb_release -a 2>/dev/null | tr '\n' ' ')
Distributor ID: Ubuntu Description: Ubuntu 11.04 Release: 11.04 Codename: natty
Code:
$ cd /tmp/testjiffy
$ ls
Makefile rerun.sh testjiffy.c
Makefile:
obj-m += testjiffy.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
testjiffy.c:
/*
* [http://www.tldp.org/LDP/lkmpg/2.6/html/lkmpg.html#AEN189 The Linux Kernel Module Programming Guide]
*/
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#include <linux/init.h> /* Needed for the macros */
#include <linux/jiffies.h>
#include <linux/time.h>
#define MAXRUNS 10
static volatile int runcount = 0;
static struct timer_list my_timer;
static void testjiffy_timer_function(unsigned long data)
{
int tdelay = 100;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
my_timer.expires = jiffies + 1;
add_timer(&my_timer);
}
}
static int __init testjiffy_init(void)
{
printk(KERN_INFO
"Init testjiffy: %d ; HZ: %d ; 1/HZ (ms): %d\n",
runcount, HZ, 1000/HZ);
init_timer(&my_timer);
my_timer.function = testjiffy_timer_function;
//my_timer.data = (unsigned long) runcount;
my_timer.expires = jiffies + 1;
add_timer(&my_timer);
return 0;
}
static void __exit testjiffy_exit(void)
{
printk(KERN_INFO "Exit testjiffy\n");
}
module_init(testjiffy_init);
module_exit(testjiffy_exit);
MODULE_LICENSE("GPL");
rerun.sh:
#!/usr/bin/env bash
set -x
make clean
make
# blank syslog first
sudo bash -c 'echo "0" > /var/log/syslog'
sleep 1 # MUSTHAVE 01!
# reload kernel module/driver
sudo insmod ./testjiffy.ko
sleep 1 # MUSTHAVE 02!
sudo rmmod testjiffy
set +x
# copy & process syslog
max=0;
for ix in _testjiffy_*.syslog; do
aa=${ix#_testjiffy_};
ab=${aa%.syslog} ;
case $ab in
*[!0-9]*) ab=0;; # reset if non-digit obtained; else
*) ab=$(echo $ab | bc);; # remove leading zeroes (else octal)
esac
if (( $ab > $max )) ; then
max=$((ab));
fi;
done;
newm=$( printf "%05d" $(($max+1)) );
PLPROC='chomp $_;
if (!$p) {$p=0;}; if (!$f) {$f=$_;} else {
$a=$_-$f; $d=$a-$p;
print "$a $d\n" ; $p=$a;
};'
set -x
grep "testjiffy" /var/log/syslog | cut -d' ' -f7- > _testjiffy_${newm}.syslog
grep "testjiffy_timer_function" _testjiffy_${newm}.syslog \
| sed 's/\[\(.*\)\].*/\1/' \
| perl -ne "$PLPROC" \
> _testjiffy_${newm}.dat
set +x
cat > _testjiffy_${newm}.gp <<EOF
set terminal pngcairo font 'Arial,10' size 900,500
set output '_testjiffy_${newm}.png'
set style line 1 linetype 1 linewidth 3 pointtype 3 linecolor rgb "red"
set multiplot layout 1,2 title "_testjiffy_${newm}.syslog"
set xtics rotate by -45
set title "Time positions"
set yrange [0:1.5]
set offsets graph 50e-3, 1e-3, 0, 0
plot '_testjiffy_${newm}.dat' using 1:(1.0):xtic(gprintf("%.3se%S",\$1)) notitle with points ls 1, '_testjiffy_${newm}.dat' using 1:(1.0) with impulses ls 1
binwidth=0.05e-3
set boxwidth binwidth
bin(x,width)=width*floor(x/width) + width/2.0
set title "Delta diff histogram"
set style fill solid 0.5
set autoscale xy
set offsets graph 0.1e-3, 0.1e-3, 0.1, 0.1
plot '_testjiffy_${newm}.dat' using (bin(\$2,binwidth)):(1.0) smooth freq with boxes ls 1
unset multiplot
EOF
set -x; gnuplot _testjiffy_${newm}.gp ; set +x
EDIT: Motivated by this comment by #granquet, I tried to obtain scheduler statistics from /proc/schedstat and /proc/sched_debug, by using dd through call_usermodehelper; note that this most of the time "skips" (that is, a file due to the 7th, or 6th, or Xth run of the function would be missing); but I managed to obtain two complete runs, and posted them in https://gist.github.com/anonymous/5709699 (as I noticed gist may be preferred to pastebin on SO), given the output is kinda massive; the *_11* files log a proper run, the *_17* files log a run with a "drop".
Note I also switched to mod_timer_pinned in the module, and it doesn't help much (the gist logs are obtained with the module using this function). These are the changes in testjiffy.c:
#include <linux/kmod.h> // usermode-helper API
...
char fcmd[] = "of=/tmp/testjiffy_sched00";
char *dd1argv[] = { "/bin/dd", "if=/proc/schedstat", "oflag=append", "conv=notrunc", &fcmd[0], NULL };
char *dd2argv[] = { "/bin/dd", "if=/proc/sched_debug", "oflag=append", "conv=notrunc", &fcmd[0], NULL };
static char *envp[] = {
"HOME=/",
"TERM=linux",
"PATH=/sbin:/bin:/usr/sbin:/usr/bin", NULL };
static void testjiffy_timer_function(unsigned long data)
{
int tdelay = 100;
unsigned long tjnow;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
mod_timer_pinned(&my_timer, jiffies + 1);
tjnow = jiffies;
printk(KERN_INFO
" testjiffy expires: %lu - jiffies %lu => %lu / %lu\n",
my_timer.expires, tjnow, my_timer.expires-tjnow, jiffies);
sprintf(fcmd, "of=/tmp/testjiffy_sched%02d", runcount);
call_usermodehelper( dd1argv[0], dd1argv, envp, UMH_NO_WAIT );
call_usermodehelper( dd2argv[0], dd2argv, envp, UMH_NO_WAIT );
}
}
... and this in rerun.sh:
...
set +x
for ix in /tmp/testjiffy_sched*; do
echo $ix | tee -a _testjiffy_${newm}.sched
cat $ix >> _testjiffy_${newm}.sched
done
set -x ; sudo rm /tmp/testjiffy_sched* ; set +x
cat > _testjiffy_${newm}.gp <<EOF
...
I'll use this post for verbose replying.
#CL.: many thanks for the answer. Good to have it confirmed that it's "possible that your timer function gets called at a later jiffy"; by logging the jiffies, I too realized that the timer function gets called at a later time - and other than that, it doesn't anything "wrong" per se.
Good to know about the timestamps; I wonder if it is possible that: the timer functions hits at the right time, but the kernel preempts the kernel logging service (I believe it's klogd), so I get a delayed timestamp? However, I'm trying to create a "looped" (or rather, periodic) timer function to write to hardware, and I first noted this "drop" by realizing the PC doesn't write data at certain intervals on the USB bus; and given that the timestamps confirm that behavior, it's probably not the problem here (I guess).
I have modified the timer function so it fires relative to the scheduled time of the last timer (my_timer.expires) - again via mod_timer_pinned instead of add_timer:
static void testjiffy_timer_function(unsigned long data)
{
int tdelay = 100;
unsigned long tjlast;
unsigned long tjnow;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
tjlast = my_timer.expires;
mod_timer_pinned(&my_timer, tjlast + 1);
tjnow = jiffies;
printk(KERN_INFO
" testjiffy expires: %lu - jiffies %lu => %lu / %lu last: %lu\n",
my_timer.expires, tjnow, my_timer.expires-tjnow, jiffies, tjlast);
}
}
... and the first few tries, it works impeccably - however, eventually, I get this:
[13389.775508] Init testjiffy: 0 ; HZ: 250 ; 1/HZ (ms): 4
[13389.776051] testjiffy_timer_function: runcount 1
[13389.776063] testjiffy expires: 3272445 - jiffies 3272444 => 1 / 3272444 last: 3272444
[13389.780053] testjiffy_timer_function: runcount 2
[13389.780068] testjiffy expires: 3272446 - jiffies 3272445 => 1 / 3272445 last: 3272445
[13389.788054] testjiffy_timer_function: runcount 3
[13389.788073] testjiffy expires: 3272447 - jiffies 3272447 => 0 / 3272447 last: 3272446
[13389.788090] testjiffy_timer_function: runcount 4
[13389.788096] testjiffy expires: 3272448 - jiffies 3272447 => 1 / 3272447 last: 3272447
[13389.792070] testjiffy_timer_function: runcount 5
[13389.792091] testjiffy expires: 3272449 - jiffies 3272448 => 1 / 3272448 last: 3272448
[13389.796044] testjiffy_timer_function: runcount 6
[13389.796062] testjiffy expires: 3272450 - jiffies 3272449 => 1 / 3272449 last: 3272449
[13389.800053] testjiffy_timer_function: runcount 7
[13389.800063] testjiffy expires: 3272451 - jiffies 3272450 => 1 / 3272450 last: 3272450
[13389.804056] testjiffy_timer_function: runcount 8
[13389.804072] testjiffy expires: 3272452 - jiffies 3272451 => 1 / 3272451 last: 3272451
[13389.808045] testjiffy_timer_function: runcount 9
[13389.808057] testjiffy expires: 3272453 - jiffies 3272452 => 1 / 3272452 last: 3272452
[13389.812054] testjiffy_timer_function: runcount 10
[13390.815415] Exit testjiffy
... which renders like so:
... so, basically I have a delay/"drop" at +8ms slot (which should be #3272446 jiffies), and then have two functions run at the +12ms slot (which would be # 3272447 jiffies); you can even see the label on the plot as "more bolded" because of it. This is better, in the sense of the "drop" sequence now being synchronous to a proper, non-drop sequence (which is as you said: "to avoid that one late timer function shifts all following timer calls") - however, I still miss a beat; and since I have to write bytes to hardware at each beat, so I keep a sustained, constant transfer rate, this unfortunately doesn't help me much.
As for the other suggestion, to "use ten timers" - because of my ultimate goal (write to hardware using a periodic lo-res timer function); I thought at first it doesn't apply - but if nothing else is possible (other than doing some special real-time kernel preparations), then I'll certainly try a scheme where I have 10 (or N) timers (maybe stored in an array) which are fired periodically one after another.
EDIT: just adding leftover relevant comments:
USB transfers are either scheduled in advance (isochronous) or have no timing guarantees (asynchronous). If your device doesn't use isochronous transfers, it's badly misdesigned. – CL. Jun 5 at 10:47
Thanks for the comment, #CL. - "... scheduled in advance (isochronous)..." cleared a confusion I had. I'm (eventually) targeting an FT232, which only has BULK mode - and as long as the bytes per timer hit is low, I can actually "cheat" my way through in "streaming" data with add_timer; however, when I transfer ammount of bytes close to consuming bandwidth, then these "misfires" start getting noticeable as drops. So I was interested in testing the limits of that, for which I need a reliably repetitive "timer" function - is there anything else I could try to have a reliable "timer"? – sdaau Jun 5 at 12:27
#sdaau Bulk transfers are not suitable for streaming. You cannot fix shortcomings in the hardware protocol by using another kind of software timer. – CL. Jun 5 at 13:50
... and as my response to #CL. : I'm aware I wouldn't be able to fix shortcomings; I was more interested in observing these shortcomings - say, if a kernel function makes a periodic USB write, I could observe the signals on a scope/analyzer, and hopefully see in what sense is the bulk mode unsuitable. But first, I'd have to trust that the function can (at least somewhat) reliably repeat at a periodic rate (i.e. "generate" a clock/tick) - and I wasn't aware, until now, that I cannot really trust add_timer at jiffies resolution (as it is capable of relatively easily skipping a whole period). However, it seems that moving to Linux' high-resolution timers (hrtimer) does give me a reliable periodic function in this sense - so I guess that solves my problem (posted in my answer below).
Many thanks for all the comments and answers; they all pointed to things that must be taken into account - but given I'm somewhat of a forever noob, I still needed to do some more reading, before gaining some understanding (I hope a correct one). Also, I couldn't really find anything specific for periodically "ticking" functions - so I'll post a more verbose answer here.
In brief - for a reliable periodic Linux kernel function at a resolution of a jiffy, do not use add_timer (<linux/time.h>), as it may "drop" an entire period; use high-resolution timers (<linux/hrtimer.h>) instead. In more detail:
Is it possible that I get a "wrong" timestamp - ...?
#CL.: The timestamp in the log is the time when that string was printed to the log.
So, maybe it's possible - but it turns out, that's not the problem here:
Is this expected behavior from add_timer at this resolution (that a period can occasionally be missed)?
I guess, it turns out - yes:
If so, is there a way to "force" add_timer to fire the function at each 4ms slot, as specified by a jiffy on this platform?
... and (I guess again), it turns out - no.
Now, the reasons for this are somewhat subtle - and I hope if I didn't get them right, someone will correct me. First of all, the first misconception that I had, was that "a clock is just a clock" (in the sense of: even if it is implemented as computer code) - but that is not quite correct. The kernel basically has to "queue" an "event" somewhere, each time something like add_timer is used; and this request may come from anything really: from any (and all) sort(s) of driver(s), or even possibly userspace.
The problem is that this "queuing" costs - since in addition to the kernel having to handle (the equivalent of) traversing and inserting (and removing) items in an array, it also has to handle timer delays spanning several orders of magnitude (from say milliseconds to maybe 10s of seconds); and the fact that some drivers (like, apparently, those for network protocols) apparently queue a lot of timer events, which are usually cancelled before running - while other types may require a completely different behavior (like in my case - in a periodic function, you expect that most of the time, the event will usually not be cancelled; and you also queue the events one by one). On top of that, the kernel needs to handle this for uniprocessor vs. SMP vs. multiprocessor platforms. Thus, there is a cost-benefit tradeoff involved in implementing timer handling in the kernel.
It turns out, the architecture around jiffies/add_timer is designed to handle the most common devices - and for them, precision at a resolution of a jiffy is not an issue; but this also means that one cannot expect a reliable timer at resolution of a single jiffy with this method. This is also compounded by the fact that the kernel handles these "event queues" by treating them (somewhat) like interrupt service requests (IRQ); and that there are several levels of priority in IRQ handling in the kernel, where higher priority routine can pre-empt a lower priority one (that is: interrupt and suspend a lower priority routine, even if it is being executed at the time - and allow the higher priority routine to go about its business). Or, as previously noted:
#granquet: timers run in soft irq context, which means they have the highest priority and they preempt everything running/runnable on the CPU ... but hardware interrupts which are not disabled when servicing a soft irq. So you might (most probable explanation) get an Hardware interrupt here and there that preempts your timer ... and thus you get an interrupt that is not serviced at the right time.
#CL.: It is indeed possible that your timer function gets called at a later jiffy than what expires what set to. Possible reasons are scheduling delays, other drivers that disable interrupts for too long (graphics and WLAN drivers are usual culprits), or some crappy BIOS executing SMI code.
I now think so, too - I think this could be an illustration of what happens:
jiffies changes to, say, 10000 (== 40000 ms #250 Hz)
Let's say the timer function, (queued by add_timer) is about to start running - but hasn't started running yet
Let's say here, the network card generates (for whatever reason) a hardware interrupt
The hardware interrupt, having a higher priority, triggers the kernel to pre-empt (stop and suspend) the timer function (possibly started by now, and just few instructions in);
That means the kernel now has to reschedule the timer function, to run at a later point - and since one only works with integer operations in the kernel, and time resolution for this kind of event is in jiffies - the best it can do is reschedule it for jiffies+1 (10001 == 40004 ms #250 Hz)
Now the kernel switches the context to the IRQ service routine of the network card driver, and it goes about its business
Let's say the IRQ service routine completes in 200 μs - that means now we're (in "absolute" terms) at 40000.2 ms - however, we are also still at 10000 jiffies
If the kernel now switched the context back to the timer function, it would have completed - without me necessarily noticing the delay;
... however, that will not happen, because the timer function is scheduled for the next jiffy!
So kernel goes about its business (possibly sleeping) for the next approx 3.8 ms
jiffies changes to 10001 (== 40004 ms #250 Hz)
(the previously rescheduled) timer function runs - and this time completes without interruption
I haven't really done a detailed analysis to see if the sequence of events is exactly as described above; but I'm quite persuaded that it is something close - in other words, a resolution problem - especially since the high-resolution timer approach seems to not show this behavior. It would be great indeed, to obtain a scheduler log, and know exactly what happened to cause a pre-empt - but I doubt the roundtrip to userspace, which I attempted in OP edit, in response to #granquet's comment, is the right thing to do.
In any case, going back to this:
Note that I'm not looking for a period resolution below what corresponds to a jiffy (in this case, 4ms); nor am I looking to decrease the delta variance when the code works properly. So as I see it, I don't have "high resolution timer" demands, nor "hard real-time" demands ...
... here was a bad mistake I made - as the analysis above shows, I did have "high resolution" demands! And had I realized that earlier, I may have found relevant reading sooner. Anyways, some relevant docs - even if they don't discuss specifically periodic functions - for me, were:
LDD3: 5.3. Semaphores and Mutexes - (in describing a driver with different demands from here): "no accesses will be made from interrupt handlers or other asynchronous contexts. There are no particular latency (response time) requirements; application programmers understand that I/O requests are not usually satisfied immediately"
Documentation/timers/hrtimers.txt - "The timers.c code is very "tightly coded" around jiffies and 32-bitness assumptions, and has been honed and micro-optimized for a relatively narrow use case (jiffies in a relatively narrow HZ range) for many years - and thus even small extensions to it easily break the wheel concept"
T. Gleixner, D. Niehaus Hrtimers and Beyond: Transforming the Linux Time Subsystems (pdf) - (most detailed, see also diagrams inside) "The Cascading Timer Wheel (CTW), which was implemented in 1997, replaced the original time ordered double linked list to resolve the scalability problem of the linked list's O(N) insertion time... The current approach to timer management in Linux does a good job of satisfying an extremely wide range of requirements, but it cannot provide the quality of service required in some cases precisely because it must satisfy such a wide range of requirements... The timeout related timers are kept in the existing timer wheel and a new subsystem optimized for (high resolution) timer requirements hrtimers was implemented. hrtimers are entirely based on human time (units: nanoseconds)... They are kept in a time sorted, per-CPU list, implemented as a red-black tree."
The high-resolution timer API [LWN.net] - "At its core, the hrtimer mechanism remains the same. Rather than using the "timer wheel" data structure, hrtimers live on a time-sorted linked list, with the next timer to expire being at the head of the list. A separate red/black tree is also used to enable the insertion and removal of timer events without scanning through the list. But while the core remains the same, just about everything else has changed, at least superficially."
Software interrupts and realtime [LWN.net] - "The softirq mechanism is meant to handle processing that is almost — but not quite — as important as the handling of hardware interrupts. Softirqs run at a high priority (though with an interesting exception, described below), but with hardware interrupts enabled. They thus will normally preempt any work except the response to a "real" hardware interrupt... Starting with the 3.0 realtime patch set, though, that capability went away... In response, in 3.6.1-rt1, the handling of softirqs has changed again."
High- (but not too high-) resolution timeouts [LWN.net] - "_poll() and epoll_wait() take an integer number of milliseconds; select() takes a struct timeval with microsecond resolution, and ppoll() and pselect() take a struct timespec with nanosecond resolution. They are all the same, though, in that they convert this timeout value to jiffies, with a maximum resolution between one and ten milliseconds. A programmer might program a pselect() call with a 10 nanosecond timeout, but the call may not return until 10 milliseconds later, even in the absence of contention for the CPU. ... It's a useful feature, but it comes at the cost of some significant API changes._"
One thing clear from the quotes, is that high-resolution timing facilities are still under active development (with API changes) in the kernel - and I was afraid, that maybe I'd have to install a special "real-time patch" kernel. Thankfully, high-resolution timers are seemingly available (and working) in my 2.6.38-16 SMP kernel without any special changes. Below is the listing of the modified testjiffies.c kernel module, which now uses high-resolution timers, but otherwise keeps the same period as determined by jiffies. For testing, I made it loop for 200 times (instead of 10 in the OP); and running the rerun.sh script for some 20-30 times, this is the worst result I got:
The time sequence is now obviously unreadable, but the histogram can still tell us this: taking 0.00435-0.004 (= 0.004-0.00365) = 350 μs for the max deviation, it represents only 100*(350/4000) = 8.75% of the expected period; which I certainly don't have a problem with. Additionally, I never got a drop (or correspondingly, an entire 2*period = 8 ms delay), or a 0 ms delay - the captures I got, are otherwise of the quality shown on the first image in OP. Now, of course I could run a longer test and see more precisely how reliable it is - but this is all the reliability I'd expect/need to see for this simple case; contrast that to the OP, where I'd get a drop in just 10 loops, with the probability of tossing a coin - every second or third run of the rerun.sh script, I'd get a drop - even in context of low OS resource usage!
Finally, note that the source below should have the problem, spotted by #CL.: "Your module is buggy: you must ensure that the timer is not pending before the module is unloaded", fixed (in the context of hrtimer). This seemingly answers my bonus question, as it obviates the need for either of the "MUSTHAVE" sleeps in the rerun.sh script. However, note that as 200 loops # 4 ms take 0.8 s - the sleep between insmod and rmmod is needed if we want a full 200 tick capture (otherwise, on my machine, I get only some 7 ticks captured).
Well, hope I got this right now (at least most if it) - if not, corrections are welcome :)
testjiffy(-hr).c
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#include <linux/init.h> /* Needed for the macros */
#include <linux/jiffies.h>
#include <linux/time.h>
#define MAXRUNS 200
#include <linux/hrtimer.h>
static volatile int runcount = 0;
//~ static struct timer_list my_timer;
static unsigned long period_ms;
static unsigned long period_ns;
static ktime_t ktime_period_ns;
static struct hrtimer my_hrtimer;
//~ static void testjiffy_timer_function(unsigned long data)
static enum hrtimer_restart testjiffy_timer_function(struct hrtimer *timer)
{
int tdelay = 100;
unsigned long tjnow;
ktime_t kt_now;
int ret_overrun;
runcount++;
if (runcount == 5) {
while (tdelay > 0) { tdelay--; } // small delay
}
printk(KERN_INFO
" %s: runcount %d \n",
__func__, runcount);
if (runcount < MAXRUNS) {
tjnow = jiffies;
kt_now = hrtimer_cb_get_time(&my_hrtimer);
ret_overrun = hrtimer_forward(&my_hrtimer, kt_now, ktime_period_ns);
printk(KERN_INFO
" testjiffy jiffies %lu ; ret: %d ; ktnsec: %lld \n",
tjnow, ret_overrun, ktime_to_ns(kt_now));
return HRTIMER_RESTART;
}
else return HRTIMER_NORESTART;
}
static int __init testjiffy_init(void)
{
struct timespec tp_hr_res;
period_ms = 1000/HZ;
hrtimer_get_res(CLOCK_MONOTONIC, &tp_hr_res);
printk(KERN_INFO
"Init testjiffy: %d ; HZ: %d ; 1/HZ (ms): %ld ; hrres: %lld.%.9ld\n",
runcount, HZ, period_ms, (long long)tp_hr_res.tv_sec, tp_hr_res.tv_nsec );
hrtimer_init(&my_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
my_hrtimer.function = &testjiffy_timer_function;
period_ns = period_ms*( (unsigned long)1E6L );
ktime_period_ns = ktime_set(0,period_ns);
hrtimer_start(&my_hrtimer, ktime_period_ns, HRTIMER_MODE_REL);
return 0;
}
static void __exit testjiffy_exit(void)
{
int ret_cancel = 0;
while( hrtimer_callback_running(&my_hrtimer) ) {
ret_cancel++;
}
if (ret_cancel != 0) {
printk(KERN_INFO " testjiffy Waited for hrtimer callback to finish (%d)\n", ret_cancel);
}
if (hrtimer_active(&my_hrtimer) != 0) {
ret_cancel = hrtimer_cancel(&my_hrtimer);
printk(KERN_INFO " testjiffy active hrtimer cancelled: %d (%d)\n", ret_cancel, runcount);
}
if (hrtimer_is_queued(&my_hrtimer) != 0) {
ret_cancel = hrtimer_cancel(&my_hrtimer);
printk(KERN_INFO " testjiffy queued hrtimer cancelled: %d (%d)\n", ret_cancel, runcount);
}
printk(KERN_INFO "Exit testjiffy\n");
}
module_init(testjiffy_init);
module_exit(testjiffy_exit);
MODULE_LICENSE("GPL");
It is indeed possible that your timer function gets called at a later jiffy than what expires what set to.
Possible reasons are scheduling delays, other drivers that disable interrupts for too long (graphics and WLAN drivers are usual culprits), or some crappy BIOS executing SMI code.
If you want to avoid that one late timer function shifts all following timer calls, you have to schedule the respective next timer not relative to the current time (jiffies), but relative to the scheduled time of the last timer (my_timer.expires).
Alternatively, use ten timers that you all schedule at the beginning at jiffies + 1, 2, 3, …
The timestamp in the log is the time when that string was printed to the log.
Your module is buggy: you must ensure that the timer is not pending before the module is unloaded.

How do I get the total CPU usage of an application from /proc/pid/stat?

I was wondering how to calculate the total CPU usage of a process.
If I do cat /proc/pid/stat, I think the relevant fields are (taken from lindevdoc.org):
CPU time spent in user code, measured in jiffies
CPU time spent in kernel code, measured in jiffies
CPU time spent in user code, including time from children
CPU time spent in kernel code, including time from children
So is the total time spend the sum of fields 14 to 17?
Preparation
To calculate CPU usage for a specific process you'll need the following:
/proc/uptime
#1 uptime of the system (seconds)
/proc/[PID]/stat
#14 utime - CPU time spent in user code, measured in clock ticks
#15 stime - CPU time spent in kernel code, measured in clock ticks
#16 cutime - Waited-for children's CPU time spent in user code (in clock ticks)
#17 cstime - Waited-for children's CPU time spent in kernel code (in clock ticks)
#22 starttime - Time when the process started, measured in clock ticks
Hertz (number of clock ticks per second) of your system.
In most cases, getconf CLK_TCK can be used to return the number of clock ticks.
The sysconf(_SC_CLK_TCK) C function call may also be used to return the hertz value.
Calculation
First we determine the total time spent for the process:
total_time = utime + stime
We also have to decide whether we want to include the time from children processes. If we do, then we add those values to total_time:
total_time = total_time + cutime + cstime
Next we get the total elapsed time in seconds since the process started:
seconds = uptime - (starttime / Hertz)
Finally we calculate the CPU usage percentage:
cpu_usage = 100 * ((total_time / Hertz) / seconds)
See also
Top and ps not showing the same cpu result
How to get total cpu usage in Linux (c++)
Calculating CPU usage of a process in Linux
Yes, you can say so. You can convert those values into seconds using formula:
sec = jiffies / HZ ; here - HZ = number of ticks per second
HZ value is configurable - done at kernel configuration time.
Here is my simple solution written in BASH. It is a linux/unix system monitor and process manager through procfs, like "top" or "ps". There is two versions simple monochrome(fast) and colored version(little bit slow, but useful especially for monitoring the statе of processes). I made sorting by CPU usage.
https://github.com/AraKhachatryan/top
utime, stime, cutime, cstime, starttime used for getting cpu usage and obtained from /proc/[pid]/stat file.
state, ppid, priority, nice, num_threads parameters obtained also from /proc/[pid]/stat file.
resident and data_and_stack parameters used for getting memory usage and obtained from /proc/[pid]/statm file.
function my_ps
{
pid_array=`ls /proc | grep -E '^[0-9]+$'`
clock_ticks=$(getconf CLK_TCK)
total_memory=$( grep -Po '(?<=MemTotal:\s{8})(\d+)' /proc/meminfo )
cat /dev/null > .data.ps
for pid in $pid_array
do
if [ -r /proc/$pid/stat ]
then
stat_array=( `sed -E 's/(\([^\s)]+)\s([^)]+\))/\1_\2/g' /proc/$pid/stat` )
uptime_array=( `cat /proc/uptime` )
statm_array=( `cat /proc/$pid/statm` )
comm=( `grep -Po '^[^\s\/]+' /proc/$pid/comm` )
user_id=$( grep -Po '(?<=Uid:\s)(\d+)' /proc/$pid/status )
user=$( id -nu $user_id )
uptime=${uptime_array[0]}
state=${stat_array[2]}
ppid=${stat_array[3]}
priority=${stat_array[17]}
nice=${stat_array[18]}
utime=${stat_array[13]}
stime=${stat_array[14]}
cutime=${stat_array[15]}
cstime=${stat_array[16]}
num_threads=${stat_array[19]}
starttime=${stat_array[21]}
total_time=$(( $utime + $stime ))
#add $cstime - CPU time spent in user and kernel code ( can olso add $cutime - CPU time spent in user code )
total_time=$(( $total_time + $cstime ))
seconds=$( awk 'BEGIN {print ( '$uptime' - ('$starttime' / '$clock_ticks') )}' )
cpu_usage=$( awk 'BEGIN {print ( 100 * (('$total_time' / '$clock_ticks') / '$seconds') )}' )
resident=${statm_array[1]}
data_and_stack=${statm_array[5]}
memory_usage=$( awk 'BEGIN {print( (('$resident' + '$data_and_stack' ) * 100) / '$total_memory' )}' )
printf "%-6d %-6d %-10s %-4d %-5d %-4s %-4u %-7.2f %-7.2f %-18s\n" $pid $ppid $user $priority $nice $state $num_threads $memory_usage $cpu_usage $comm >> .data.ps
fi
done
clear
printf "\e[30;107m%-6s %-6s %-10s %-4s %-3s %-6s %-4s %-7s %-7s %-18s\e[0m\n" "PID" "PPID" "USER" "PR" "NI" "STATE" "THR" "%MEM" "%CPU" "COMMAND"
sort -nr -k9 .data.ps | head -$1
read_options
}
If need to calculate how much cpu% used by a process in last 10 secs
get
total_time (13+14) in jiffies => t1
starttime(22) in jiffies => s1
--delay of 10 secs
total_time (13+14) in jiffies => t2
starttime(22) in jiffies => s2
t2-t1 *100 / s2 - s1
wouldnt give the % ??
Here is another way that I got my App's CPU usage. I did this in Android, and it makes a kernel top call and gets the CPU usage for your apps PID using what top returns.
public void myWonderfulApp()
{
// Some wonderfully written code here
Integer lMyProcessID = android.os.Process.myPid();
int lMyCPUUsage = getAppCPUUsage( lMyProcessID );
// More magic
}
// Alternate way that I switched to. I found the first version was slower
// this version only returns a single line for the app, so far less parsing
// and processing.
public static float getTotalCPUUsage2()
{
try
{
// read global stats file for total CPU
BufferedReader reader = new BufferedReader(new FileReader("/proc/stat"));
String[] sa = reader.readLine().split("[ ]+", 9);
long work = Long.parseLong(sa[1]) + Long.parseLong(sa[2]) + Long.parseLong(sa[3]);
long total = work + Long.parseLong(sa[4]) + Long.parseLong(sa[5]) + Long.parseLong(sa[6]) + Long.parseLong(sa[7]);
reader.close();
// calculate and convert to percentage
return restrictPercentage(work * 100 / (float) total);
}
catch (Exception ex)
{
Logger.e(Constants.TAG, "Unable to get Total CPU usage");
}
// if there was an issue, just return 0
return 0;
}
// This is an alternate way, but it takes the entire output of
// top, so there is a fair bit of parsing.
public static int getAppCPUUsage( Integer aAppPID)
{
int lReturn = 0;
// make sure a valid pid was passed
if ( null == aAppPID && aAppPID > 0)
{
return lReturn;
}
try
{
// Make a call to top so we have all the processes CPU
Process lTopProcess = Runtime.getRuntime().exec("top");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(lTopProcess.getInputStream()));
String lLine;
// While we have stuff to read and we have not found our PID, process the lines
while ( (lLine = bufferedReader.readLine()) != null )
{
// Split on 4, the CPU % is the 3rd field .
// NOTE: We trim because sometimes we had the first field in the split be a "".
String[] lSplit = lLine.trim().split("[ ]+", 4);
// Don't even bother if we don't have at least the 4
if ( lSplit.length > 3 )
{
// Make sure we can handle if we can't parse the int
try
{
// On the line that is our process, field 0 is a PID
Integer lCurrentPID = Integer.parseInt(lSplit[0]);
// Did we find our process?
if (aAppPID.equals(lCurrentPID))
{
// This is us, strip off the % and return it
String lCPU = lSplit[2].replace("%", "");
lReturn = Integer.parseInt(lCPU);
break;
}
}
catch( NumberFormatException e )
{
// No op. We expect this when it's not a PID line
}
}
}
bufferedReader.close();
lTopProcess.destroy(); // Cleanup the process, otherwise you make a nice hand warmer out of your device
}
catch( IOException ex )
{
// Log bad stuff happened
}
catch (Exception ex)
{
// Log bad stuff happened
}
// if there was an issue, just return 0
return lReturn;
}
Here's what you're looking for:
//USER_HZ detection, from openssl code
#ifndef HZ
# if defined(_SC_CLK_TCK) \
&& (!defined(OPENSSL_SYS_VMS) || __CTRL_VER >= 70000000)
# define HZ ((double)sysconf(_SC_CLK_TCK))
# else
# ifndef CLK_TCK
# ifndef _BSD_CLK_TCK_ /* FreeBSD hack */
# define HZ 100.0
# else /* _BSD_CLK_TCK_ */
# define HZ ((double)_BSD_CLK_TCK_)
# endif
# else /* CLK_TCK */
# define HZ ((double)CLK_TCK)
# endif
# endif
#endif
This code is actually from cpulimit, but uses openssl snippets.

How to calculate CPU utilization of a process & all its child processes in Linux?

I want to know the CPU utilization of a process and all the child processes, for a fixed period of time, in Linux.
To be more specific, here is my use-case:
There is a process which waits for a request from the user to execute the programs. To execute the programs, this process invokes child processes (maximum limit of 5 at a time) & each of this child process executes 1 of these submitted programs (let's say user submitted 15 programs at once). So, if user submits 15 programs, then 3 batches of 5 child processes each will run. Child processes are killed as soon as they finish their execution of the program.
I want to know about % CPU Utilization for the parent process and all its child process during the execution of those 15 programs.
Is there any simple way to do this using top or another command? (Or any tool i should attach to the parent process.)
You can find this information in /proc/PID/stat where PID is your parent process's process ID. Assuming that the parent process waits for its children then the total CPU usage can be calculated from utime, stime, cutime and cstime:
utime %lu
Amount of time that this process has been scheduled in user mode,
measured in clock ticks (divide by sysconf(_SC_CLK_TCK). This includes
guest time, guest_time (time spent running a virtual CPU, see below),
so that applications that are not aware of the guest time field do not
lose that time from their calculations.
stime %lu
Amount of time that this process has been scheduled in kernel mode,
measured in clock ticks (divide by sysconf(_SC_CLK_TCK).
cutime %ld
Amount of time that this process's waited-for children have been
scheduled in user mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK). (See also times(2).) This includes guest time,
cguest_time (time spent running a virtual CPU, see below).
cstime %ld
Amount of time that this process's waited-for children have been
scheduled in kernel mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK).
See proc(5) manpage for details.
And of course you can do it in hardcore-way using good old C
find_cpu.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define MAX_CHILDREN 100
/**
* System command execution output
* #param <char> command - system command to execute
* #returb <char> execution output
*/
char *system_output (const char *command)
{
FILE *pipe;
static char out[1000];
pipe = popen (command, "r");
fgets (out, sizeof(out), pipe);
pclose (pipe);
return out;
}
/**
* Finding all process's children
* #param <Int> - process ID
* #param <Int> - array of childs
*/
void find_children (int pid, int children[])
{
char empty_command[] = "/bin/ps h -o pid --ppid ";
char pid_string[5];
snprintf(pid_string, 5, "%d", pid);
char *command = (char*) malloc(strlen(empty_command) + strlen(pid_string) + 1);
sprintf(command, "%s%s", empty_command, pid_string);
FILE *fp = popen(command, "r");
int child_pid, i = 1;
while (fscanf(fp, "%i", &child_pid) != EOF)
{
children[i] = child_pid;
i++;
}
}
/**
* Parsign `ps` command output
* #param <char> out - ps command output
* #return <int> cpu utilization
*/
float parse_cpu_utilization (const char *out)
{
float cpu;
sscanf (out, "%f", &cpu);
return cpu;
}
int main(void)
{
unsigned pid = 1;
// getting array with process children
int process_children[MAX_CHILDREN] = { 0 };
process_children[0] = pid; // parent PID as first element
find_children(pid, process_children);
// calculating summary processor utilization
unsigned i;
float common_cpu_usage = 0.0;
for (i = 0; i < sizeof(process_children)/sizeof(int); ++i)
{
if (process_children[i] > 0)
{
char *command = (char*)malloc(1000);
sprintf (command, "/bin/ps -p %i -o 'pcpu' --no-headers", process_children[i]);
common_cpu_usage += parse_cpu_utilization(system_output(command));
}
}
printf("%f\n", common_cpu_usage);
return 0;
}
Compile:
gcc -Wall -pedantic --std=gnu99 find_cpu.c
Enjoy!
Might not be the exact command. But you can do something like below to get cpu usage of various process and add it.
#ps -C sendmail,firefox -o pcpu= | awk '{s+=$1} END {print s}'
/proc/[pid]/stat Status information about the process. This is used by ps and made into human readable form.
Another way is to use cgroups and use cpuacct.
http://www.kernel.org/doc/Documentation/cgroups/cpuacct.txt
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpuacct.html
Here's one-liner to compute total CPU for all processes. You can adjust it by passing column filter into top output:
top -b -d 5 -n 2 | awk '$1 == "PID" {block_num++; next} block_num == 2 {sum += $9;} END {print sum}'

Accurately Calculating CPU Utilization in Linux using /proc/stat

There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
long double a[7],b[7],loadavg;
FILE *fp;
for(;;)
{
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fclose(fp);
sleep(1);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
fclose(fp);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
}
return(0);
}
I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2780
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.
from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
&p_jif->steal);
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
}

Resources