What do the numbers in /proc/loadavg mean on Linux? - linux

When issuing this command on Linux:
# cat /proc/loadavg
0.75 0.35 0.25 1/25 1747
The first three numbers are load averages. What are the last 2 numbers?
The last one keeps increasing by 2 every second, should I be worried?

/proc/loadavg
The first three fields in this file are load average figures giving
the number of jobs in the run queue (state R) or waiting for disk
I/O (state D) averaged over 1, 5, and 15 minutes. They are the
same as the load average numbers given by uptime(1) and other
programs.
The fourth field consists of two numbers separated by a
slash (/). The first of these is the number of currently executing
kernel scheduling entities (processes, threads); this will be less
than or equal to the number of CPUs. The value after the slash is the
number of kernel scheduling entities that currently exist on the
system.
The fifth field is the PID of the process that was most
recently created on the system.

I would like to comment the accepted answer.
The fourth field consists of two numbers separated by a slash (/). The
first of these is the number of currently executing kernel scheduling
entities (processes, threads); this will be less than or equal to the
number of CPUs.
I did a test program that reads integer N from input and then creates N threads and their run them forever. On RHEL 6.5 computer I have 8 processor and each processor has hyper threading. Anyway if I run my test and it creates 128 threads I see in the fourth field values that are greater than 128, for example 135. It is clearly greater than the number of CPU. This post supports my observation: http://juliano.info/en/Blog:Memory_Leak/Understanding_the_Linux_load_average
It is worth noting that the current explanation in proc(5) manual page
(as of man-pages version 3.21, March 2009) is wrong. It reports the
first number of the forth field as the number of currently executing
scheduling entities, and so predicts it can't be greater than the
number of CPUs. That doesn't match the real implementation, where this
value reports the current number of runnable threads.

The first three columns measure CPU and I/O utilization of the last one, five, and 15 minute periods. The fourth column shows the number of currently running processes and the total number of processes. The last column displays the last process ID used.
https://docs.fedoraproject.org/en-US/Fedora/17/html/System_Administrators_Guide/s2-proc-loadavg.html

The following page explains these in detail:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
Some interpretations:
If the averages are 0.0, then your system is idle.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem (it depends).

You can consult the proc manual page for /proc/loadavg :
$ man proc | sed -n '/loadavg/,/^$/ p'
/proc/loadavg
The first three fields in this file are load average figures giving the number of jobs in the run queue
(state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as
the load average numbers given by uptime(1) and other programs. The fourth field consists of two num‐
bers separated by a slash (/). The first of these is the number of currently runnable kernel schedul‐
ing entities (processes, threads). The value after the slash is the number of kernel scheduling enti‐
ties that currently exist on the system. The fifth field is the PID of the process that was most
recently created on the system.
For that, you need to install the man-pages package on CentOS7/RedHat7 or the manpages package on Ubuntu 20.04/22.04 LTS.

Related

How can I use linux perf and interpret its output to understand CPU cache misses?

I am trying to measure the number of times memory references miss any CPU cache and need to fetch a cache line from memory. I have a very simple program that loads 100 million 4-byte integers into an array and then scans it or probes it randomly. I measure time, and then use perf to report various cache-related events: LLC-load, LLC-load-misses, LLC-store, LLC-store-misses. I am using Pop OS 18.10 (a variant of Ubuntu 18.10).
I run the program three ways:
1) Just load the array (100m integers).
2) Load the array and scan in physical order.
3) Load the array and read 100m random array locations.
#3 is 40x slower than #2, which is not surprising.
I am having some trouble both knowing what perf events to examine, and how to interpret the results:
I discovered the LLC-* events by googling, but they are not mentioned by "perf list".
I subtract the counts of events of the load-only run (#1) from the load-and-scan runs (#2, #3). The numbers are generally lower from the physical scan (#2) compared to the random access (#3). But from reading the documentation, and looking at the numbers, I don't really understand what the various events represent.
Does perf count events or does it sample them? If it's a true count, then I really can't make sense of the numbers I'm seeing. (E.g. the number of LLC-load-misses events doesn't match the number of cache line transfers that should be needed.)

How "real-time" are the FIFO/RR schedulers on non-RT Linux kernel?

Say on a non-RT Linux kernel (4.14, Angstrom distro, running on iMX6) I have a program that receives UDP packets (< 1400 bytes) that come in at a very steady data rate. Basically,
the essence of the program is:
while (true)
{ recv( sockFd, ... );
update_loop_interval_histogram(); // O(1)
}
To minimize the maximally occuring delay time (loop intervals), I started my process with:
chrt --fifo 99 ./programName
setting the scheduler to a "real-time" mode SCHED_FIFO with highest priority.
the CPU affinity of my process is fixed to the 2nd core.
Next to that, I ran a benchmark program instance per core, deliberately getting the CPU load to 100%.
That way, I get a maximum loop interval of ~10ms (vs. ~25ms without SCHED_FIFO). The occurence of this is rare. During e.g. an hour runtime, the counts sum of all intervals <400µs divided by the sum of all counts of all other interval time occurences from 400µs to 10000µs is over 1.5 million.
But as rare as it is, it's still bad.
Is that the best one can reliably get on a non-RealTime Linux kernel, or are there further tweaks to be made to get to something like 5ms maximum interval time?

PERF STAT does not count memory-loads but counts memory-stores

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

Measure CPU usage of multithreaded program

In a GCC bug report (Bug 51617), the poster times his asynchronous (C++11) program's execution with an output looking like this:
/tmp/tst 81.54s user 0.23s system 628% cpu 13.001 total
What could the poster be using which gives that (or similar) output?
NB: An inspection of my man entry for time doesn't suggest anything useful to this end.
The time(1) man page on my (Ubuntu) Linux system says:
-f FORMAT, --format FORMAT
Use FORMAT as the format string that controls the output of
time. See the below more information.
:
FORMATTING THE OUTPUT
The format string FORMAT controls the contents of the time output. The
format string can be set using the `-f' or `--format', `-v' or `--ver‐
bose', or `-p' or `--portability' options. If they are not given, but
the TIME environment variable is set, its value is used as the format
string. Otherwise, a built-in default format is used. The default
format is:
%Uuser %Ssystem %Eelapsed %PCPU (%Xtext+%Ddata %Mmax)k
%Iinputs+%Ooutputs (%Fmajor+%Rminor)pagefaults %Wswaps
:
The resource specifiers, which are a superset of those recognized by
the tcsh(1) builtin `time' command, are:
% A literal `%'.
C Name and command line arguments of the command being
timed.
D Average size of the process's unshared data area, in
Kilobytes.
E Elapsed real (wall clock) time used by the process, in
[hours:]minutes:seconds.
F Number of major, or I/O-requiring, page faults that oc‐
curred while the process was running. These are faults
where the page has actually migrated out of primary memo‐
ry.
I Number of file system inputs by the process.
K Average total (data+stack+text) memory use of the
process, in Kilobytes.
M Maximum resident set size of the process during its life‐
time, in Kilobytes.
O Number of file system outputs by the process.
P Percentage of the CPU that this job got. This is just
user + system times divided by the total running time. It
also prints a percentage sign.
R Number of minor, or recoverable, page faults. These are
pages that are not valid (so they fault) but which have
not yet been claimed by other virtual pages. Thus the
data in the page is still valid but the system tables
must be updated.
S Total number of CPU-seconds used by the system on behalf
of the process (in kernel mode), in seconds.
U Total number of CPU-seconds that the process used direct‐
ly (in user mode), in seconds.
W Number of times the process was swapped out of main memo‐
ry.
X Average amount of shared text in the process, in Kilo‐
bytes.
Z System's page size, in bytes. This is a per-system con‐
stant, but varies between systems.
c Number of times the process was context-switched involun‐
tarily (because the time slice expired).
e Elapsed real (wall clock) time used by the process, in
seconds.
k Number of signals delivered to the process.
p Average unshared stack size of the process, in Kilobytes.
r Number of socket messages received by the process.
s Number of socket messages sent by the process.
t Average resident set size of the process, in Kilobytes.
w Number of times that the program was context-switched
voluntarily, for instance while waiting for an I/O opera‐
tion to complete.
x Exit status of the command.
So you can get CPU percentage as %P in the format.
Note that this is for the /usr/bin/time binary -- the shell time builtin is usually different (and less capable)

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

Resources