profile program's speed on Linux - linux

I have a couple variants of a program that I want to compare on performance. Both perform essentially the same task.
One does it all in C and memory. The other calls an external utility and does file IO.
How do I reliably compare them?
1) Getting "time on CPU" using "time" favors the second variant for calling system() and doing IO. Even if I add "system" time to "user" time, it'll still not count for time spent blocked on wait().
2) I can't just clock them for they run on a server and can be pushed off the CPU any time. Averaging across 1000s of experiments is a soft option, since I have no idea how my server is utilized - it's a VM on a cluster, it's kind of complicated.
3) profilers do not help since they'll give me time spent in the code, which again favors the version that does system()
I need to add up all CPU time that these programs consume, including user, kernel, IO, and children's recursively.
I expected this to be a common problem, but still don't seem to find a solution.
(Solved with times() - see below. Thanks everybody)

If I've understood, typing "time myapplication" on a bash command line is not what you are looking for.
If you want accuracy, you must use a profiler... You have the source, yes?
Try something like Oprofile or Valgrind, or take a look at this for a more extended list.
If you haven't the source, honestly I don't know...

/usr/bin/time (not built-in "time" in bash) can give some interesting stats.
$ /usr/bin/time -v xeyes
Command being timed: "xeyes"
User time (seconds): 0.00
System time (seconds): 0.01
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.57
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 517
Voluntary context switches: 243
Involuntary context switches: 0
Swaps: 0
File system inputs: 1072
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Run them a thousand times, measure actual time taken, then average the results. That should smooth out any variances due to other applications running on your server.

I seem to have found it at last.
NAME
times - get process times
SYNOPSIS
#include
clock_t times(struct tms *buf);
DESCRIPTION
times() stores the current process times in the struct tms that buf
points to. The struct tms is as defined in :
struct tms {
clock_t tms_utime; /* user time */
clock_t tms_stime; /* system time */
clock_t tms_cutime; /* user time of children */
clock_t tms_cstime; /* system time of children */
};
The children's times are a recursive sum of all waited-for children.
I wonder why it hasn't been made a standard CLI utility yet. Or may be I'm just ignorant.

I'd probably lean towards adding "time -o somefile" to the front of the system command, and then adding it to the time given by time'ing your main program to get a total. Unless I had to do this lots of times, then I'd find a way to take two time outputs and add them up to the screen (using awk or shell or perl or something).

Related

Unexpected periodic behaviour of an ultra low latency hard real time multi-threaded x86 code

I am running code in a loop for multiple iterations on a dedicated CPU with RT priority and want to observe its behaviour over a long time. I found a very strange periodic behaviour of the code.
Briefly, this is what the code does:
Arraythread
{
while(1)
{
if(flag)
Multiply matrix
record time;
reset flag;
}
}
mainthread
{
for(30 mins)
{
set flag;
record time;
busy while(500 μs)
}
}
Here are the details about the machine I am using:
CPU: Intel(R) Xeon(R) Gold 6230 CPU # 2.10 GHz
L1 cache: 32K d and 32K i
L2 cache: 1024K
L3 cache: 28160K
Kernel: 3.10.0-693.2.2.rt56.623.el7.x86_64 #1 SMP PREEMPT RT
OS: CentOS
Current active profile: latency-performance
I modified the global limit of Linux real time scheduling (sched_rt_runtime_us) from 95% to 100%
Both the above mentioned threads are bound on a single NUMA node each with priority 99
More details about the code:
mainthread sets a flag every 500 μs. I used CLOCK_MONOTOMIC_RAW with clock_gettime function to read the time (let's say T0).
I put all the variables in a structure to reduce the cache misses.
Arraythread runs a busy while loop and waits for the flag to set.
Once the flag is set it multiplies two big arrays.
Once the multiplication is done it reset the flag and record the time (let's say T1).
I run this experiment for 30 mins (= 3600000 iterations)
I measure the time difference T1-T0 once the experiment is over.
Here is the clock:
The average time of the clock is ~500.5 microseconds. There are flactuations which are expected.
Here is the time taken by the array multiplication:
This is the full 30 minute view of the result.
There are four peaks in the results. The first peak is expected since for the very first time data comes from main memory and the CPU was on sleep.
Apart from the first peak, there are three more peaks and the time difference between peak_3 and peak_2 is 11.99364 mins where the time difference between peak_4 and peak_3 is 11.99358 mins. (I assumed the clock to be 500 μsec)
If I zoom it further:
This image shows what happened over 5 minutes.
If I zoom it further:
This image shows what happened over ~1.25 mins.
You notice that average time is around 113 μsec of the multiplication and there are peaks everywhere.
If I zoom it further:
This image shows what happened over 20 seconds.
If I zoom it further:
This image shows what happened over 3.5 seconds.
The time differences between the starting line of these peaks are: 910 ms, 910 ms, 902 ms (assuming two consecutive points are at 500 μs difference)
If I zoom it further:
This image shows what happened over 500 ms
~112.6 μs is the average time here and complete data is under 1 μs range.
Here are my questions:
Given that L3 cache is good enough to store the complete executable and there is no file read right and there is nothing else is running on the machine, no context switch is happening as well, why do some of the executions take almost double (or sometimes more than double) time? [see the peaks in first result image]
If we forget about those four peaks from the first image, how do I justify the periodic peaks in the results with almost constant time difference? What does the CPU do? These periodic peaks lasts few milliseconds.
I expect the results to be near constant like in the last image. Is there a way or OS/CPU settings I can apply to run the code like last image for infinite time?
Here is the complete code:
https://github.com/sghoslya/kite/blob/main/multiThreadProfCheckArray.c

How "real-time" are the FIFO/RR schedulers on non-RT Linux kernel?

Say on a non-RT Linux kernel (4.14, Angstrom distro, running on iMX6) I have a program that receives UDP packets (< 1400 bytes) that come in at a very steady data rate. Basically,
the essence of the program is:
while (true)
{ recv( sockFd, ... );
update_loop_interval_histogram(); // O(1)
}
To minimize the maximally occuring delay time (loop intervals), I started my process with:
chrt --fifo 99 ./programName
setting the scheduler to a "real-time" mode SCHED_FIFO with highest priority.
the CPU affinity of my process is fixed to the 2nd core.
Next to that, I ran a benchmark program instance per core, deliberately getting the CPU load to 100%.
That way, I get a maximum loop interval of ~10ms (vs. ~25ms without SCHED_FIFO). The occurence of this is rare. During e.g. an hour runtime, the counts sum of all intervals <400µs divided by the sum of all counts of all other interval time occurences from 400µs to 10000µs is over 1.5 million.
But as rare as it is, it's still bad.
Is that the best one can reliably get on a non-RealTime Linux kernel, or are there further tweaks to be made to get to something like 5ms maximum interval time?

PEBS records much less memory-access samples than actually present

I have been trying to log memory accesses that are made by a program using Perf and PEBS counters. My intention was to log all of the memory accesses made by a program (I chose programs from SpecCPU2006). By tweaking certain parameters, I seem to record much more samples than there actually is for the program. I know, as has been said previously, that it is tough to record all of the memory access samples but leaving that aside, I want to know how can PEBS record more samples than there actually is?
I followed the below steps :-
First of all, I modified the /proc/sys/kernel/perf_cpu_time_max_percent value. Initially it was 25%, I changed it to 95%. This was because I wanted to see if I can record the maximum number of memory access samples. This would also allow me to probably use a much higher perf_event_max_sample_rate, which is usually 100,000 at a maximum but now I can set it to a higher value without it being lowered down.
I used a much higher value for perf_event_max_sample_rate which is 244,500, instead of the maximum allowable value of 100,000.
Now what I did was I used perf-stat to record the total count of the memory-stores information in a program. I got the below data :-
./perf stat -e cpu/mem-stores/u ../../.././libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for '../../.././libquantum_base.arnab 100':
158,115,509 cpu/mem-stores/u
0.591718162 seconds time elapsed
There are roughly ~158 million events as indicated by perf-stat, which should be a correct indicator, since this is directly coming from the hardware counter values.
But now, as I run the perf record -e command and use PEBS counters to calculate all of the memory store events that are possible :-
./perf record -e cpu/mem-stores/upp -c 1 ../../.././libquantum_base.arnab 100
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 7.827 MB perf.data (254125 samples) ]
I can see 254125 samples being recorded. This is much much less than what was returned by perf stat. I am recording all of these accesses in the userspace only (I am using -u in both cases).
Why does this happen ? Am I recording the memory-store events in any wrong way ? Or is there a problem with the CPU behavior ?

Measure CPU usage of multithreaded program

In a GCC bug report (Bug 51617), the poster times his asynchronous (C++11) program's execution with an output looking like this:
/tmp/tst 81.54s user 0.23s system 628% cpu 13.001 total
What could the poster be using which gives that (or similar) output?
NB: An inspection of my man entry for time doesn't suggest anything useful to this end.
The time(1) man page on my (Ubuntu) Linux system says:
-f FORMAT, --format FORMAT
Use FORMAT as the format string that controls the output of
time. See the below more information.
:
FORMATTING THE OUTPUT
The format string FORMAT controls the contents of the time output. The
format string can be set using the `-f' or `--format', `-v' or `--ver‐
bose', or `-p' or `--portability' options. If they are not given, but
the TIME environment variable is set, its value is used as the format
string. Otherwise, a built-in default format is used. The default
format is:
%Uuser %Ssystem %Eelapsed %PCPU (%Xtext+%Ddata %Mmax)k
%Iinputs+%Ooutputs (%Fmajor+%Rminor)pagefaults %Wswaps
:
The resource specifiers, which are a superset of those recognized by
the tcsh(1) builtin `time' command, are:
% A literal `%'.
C Name and command line arguments of the command being
timed.
D Average size of the process's unshared data area, in
Kilobytes.
E Elapsed real (wall clock) time used by the process, in
[hours:]minutes:seconds.
F Number of major, or I/O-requiring, page faults that oc‐
curred while the process was running. These are faults
where the page has actually migrated out of primary memo‐
ry.
I Number of file system inputs by the process.
K Average total (data+stack+text) memory use of the
process, in Kilobytes.
M Maximum resident set size of the process during its life‐
time, in Kilobytes.
O Number of file system outputs by the process.
P Percentage of the CPU that this job got. This is just
user + system times divided by the total running time. It
also prints a percentage sign.
R Number of minor, or recoverable, page faults. These are
pages that are not valid (so they fault) but which have
not yet been claimed by other virtual pages. Thus the
data in the page is still valid but the system tables
must be updated.
S Total number of CPU-seconds used by the system on behalf
of the process (in kernel mode), in seconds.
U Total number of CPU-seconds that the process used direct‐
ly (in user mode), in seconds.
W Number of times the process was swapped out of main memo‐
ry.
X Average amount of shared text in the process, in Kilo‐
bytes.
Z System's page size, in bytes. This is a per-system con‐
stant, but varies between systems.
c Number of times the process was context-switched involun‐
tarily (because the time slice expired).
e Elapsed real (wall clock) time used by the process, in
seconds.
k Number of signals delivered to the process.
p Average unshared stack size of the process, in Kilobytes.
r Number of socket messages received by the process.
s Number of socket messages sent by the process.
t Average resident set size of the process, in Kilobytes.
w Number of times that the program was context-switched
voluntarily, for instance while waiting for an I/O opera‐
tion to complete.
x Exit status of the command.
So you can get CPU percentage as %P in the format.
Note that this is for the /usr/bin/time binary -- the shell time builtin is usually different (and less capable)

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

Resources