/proc/stat column number on different platforms - linux

The column number of /proc/stat is different on various on different platforms. Will the meaning of the i-th column change ?
For example, the man page http://man7.org/linux/man-pages/man5/proc.5.html gives ten value columns, but on my linux only showns eight columns.
[root#localhost ~]# cat /proc/stat
cpu 148894509 5214 64962656 4534478045 18407228 6482288 24487520 0
cpu0 71026365 2633 34928452 2246110103 18371398 6482288 21933024 0
cpu1 77868143 2580 30034204 2288367942 35829 0 2554495 0

the link clearly states kernel versions that start implementing those columns:
steal (since Linux 2.6.11)
guest (since Linux 2.6.24)
guest_nice (since Linux 2.6.33)
If your kernel version is less than 2.6.24, then you will see 8 columns. But the order and meaning of columns remain the same.

Related

What is the runtime api to provide number of logical processors?

Below UNIX command:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
shows,
CPU(s): 4
Thread(s) per core: 2
which is 4 X 2 = 8 logical processors. Correct me.
Below is another Linux command:
$ cat /proc/cpuinfo
processor : 0
....
cpu cores : 2
.....
processor : 1
.....
cpu cores : 2
.....
processor : 2
.....
cpu cores : 2
.....
processor : 3
.....
cpu cores : 2
.....
$
But the below program shows only 4 logical processors:
package main
import (
"fmt"
"runtime"
)
func main() {
fmt.Println(runtime.GOMAXPROCS(0)) // gives 4
fmt.Println(runtime.NumCPU()) // gives 4
}
Output:
$ go install github.com/myhub/cs61a
$ bin/cs61a
4
4
code$
More details:
$ go version
go version go1.14.1 linux/amd64
$ uname -a
Linux mohet01-ubuntu 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Documentation says,
NumCPU returns the number of logical CPUs usable by the current process.
My understanding is,
Go scheduler creates OS threads(M) which will be equivalent to number of logical processors.
Why runtime api does not give value as 8?
According to all of your listings above, you have:
two cores per socket
and one socket
which means you have only two CPU cores. However, you also have:
two threads per core
which means your two CPUs can run up to four threads simultaneously (with some potential drawbacks, but at least in most cases this should offer significantly more computing power than using two threads total).
Since that's the actual hardware limit, the fact that Go computes the number of threads to use as 4 seems correct.
(I think the reason you are counting to eight is that you are assuming that each of the "cpus" that Linux reports supports two threads. That is not the case: there are only two physical cores, but each supports two threads, so Linux reports this as four "cpus".)

Determine syscalls or subsystems a process is spending time waiting in

I'm looking for ways to learn which syscalls or which subsystems a process or thread spends time waiting in, i.e. blocked and not scheduled to run on a CPU.
Specifically if I have some unknown process, or a process where all we know is "it's slow" I'd like to be able to learn things like:
"it spends 80% of its time in sys_write() on fd 13 which is /some/file"
"it's spending a lot of time waiting to read() from a network socket"
"it's sleeping in epoll_wait() for activity on fds [4,5,6] which are [file /boo], [socket 10.1.1.:42], [notifyfd blah]"
In other words when my program is not running on the CPU what is it doing?
This is astonishingly hard to answer with perf because it does not appear to have any way to record the duration of a syscall from sys_enter to sys_exit or otherwise keep track of how long an event is. Presumably due to its sampling nature.
I'm aware of some experimental work with eBPF for Linux 4.6 and above that may help, with Brendan Gregg's off-cpu work. But in the sad world of operations and support a 4.6 kernel is a rare unicorn to be treasured.
What are the real world options?
Do ftrace, systemtap etc offer any insights here?
You can use strace. First, you might want to get a high-level summary of the costs of each type of system call. You can obtain this summary by running strace -c. For example, one possible output is the following:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
90.07 0.000263 26 10 getdents
3.42 0.000010 0 1572 read
3.42 0.000010 0 762 1 stat
3.08 0.000009 0 1574 6 open
0.00 0.000000 0 11 write
0.00 0.000000 0 1569 close
0.00 0.000000 0 48 fstat
The % time value is with respect to overall kernel time, not overall execution time (kernel+user). This summary tells you what the most expensive system calls are. However, if you need to determine which specific instances of system calls are most expensive and what arguments are passed to them, you can run strace -i -T. The -i option shows the instruction addresses of the instructions that performed the system call and the -T option the time spent in the system call. An output might look like this:
[00007f97f1b37367] open("myfile", O_RDONLY|O_CLOEXEC) = 3 <0.000020>
[00007f97f1b372f4] fstat(3, {st_mode=S_IFREG|0644, st_size=159776, ...}) = 0 <0.000018>
[00007f97f1b374ba] mmap(NULL, 159776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f97f1d19000 <0.000019>
[00007f97f1b37467] close(3) = 0 <0.000018>
The first column shows instruction addresses, the second column shows systems calls with their arguments, the third column shows the returned value, and the last column shows the time spent in that system call. This list is ordered by the dynamic occurrence of the system call. You can filter this output using either grep or the -e option. The instruction addresses can help you locate where in the source code these system calls are made. For example, if a long sequence of system calls have the same address, then there is a good chance that you have a loop somewhere in the code that contains the system call. If your executable binary is not PIE, the dynamic addresses are the same as the static addresses shown by objdump. But even with PIE, the relative order of the dynamic addresses is the same. I don't know if there is an easy way to map these system calls to source code lines.
If you want to find out things like "it spends 80% of its time in sys_write() on fd 13 which is /some/file" then you need to write a script that first extracts the return values of all open calls and the corresponding file name arguments and then sum up the times of all sys_write calls whose fd argument is equal to some value.

why can't I match jiffies to uptime?

As far as I know, "jiffies" in Linux kernel is the number of ticks since boot, and the number of ticks in one second is defined by "HZ", so in theory:
(uptime in seconds) = jiffies / HZ
But based on my tests, the above is not true. For example:
$ uname -r
2.6.32-504.el6.x86_64
$ grep CONFIG_HZ /boot/config-2.6.32-504.el6.x86_64
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
So "HZ" is 1000, now look at the jiffies and uptime:
$ grep ^jiffies /proc/timer_list
jiffies: 8833841974
jiffies: 8833841974
...
$ cat /proc/uptime
4539183.14 144549693.77
As we can see, "jiffies" is far different to uptime. I have tested on many boxes, none of the jiffies was even close to uptime. What did I do wrong?
What you're trying to do is how Linux used to work -- 10 years ago.
It's become more complicated since then. Some of the complications that I know of are:
There's an offset of -5 minutes so that the kernel always tests jiffy rollover.
The kernel command line can set a jiffy skip value so a 1000 Hz kernel can run at 250 or 100 or 10.
Various attempts at NoHZ don't use a timer tick at all and rely only on the timer ring and the HPET.
I believe there are some virtual guest extensions that disable the tick and ask the host hypervisor whenever a tick is needed. Such as the Xen or UML builds.
That's why the kernel has functions designed to tell you the time. Use them or figure out what they are doing and copy it.
Well, I hit the same problem. After some research, I finally find the reason why jiffies looks so large compared to uptime.
It is simply because of
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
The real value of INITIAL_JIFFIES is 0xfffb6c20, if HZ is 1000. It's not 0xfffffffffffb6c20.
So if you want compute uptime from jiffies; you have to do
(jiffies - 0xfffb6c20)/HZ

What do the numbers in /proc/loadavg mean on Linux?

When issuing this command on Linux:
# cat /proc/loadavg
0.75 0.35 0.25 1/25 1747
The first three numbers are load averages. What are the last 2 numbers?
The last one keeps increasing by 2 every second, should I be worried?
/proc/loadavg
The first three fields in this file are load average figures giving
the number of jobs in the run queue (state R) or waiting for disk
I/O (state D) averaged over 1, 5, and 15 minutes. They are the
same as the load average numbers given by uptime(1) and other
programs.
The fourth field consists of two numbers separated by a
slash (/). The first of these is the number of currently executing
kernel scheduling entities (processes, threads); this will be less
than or equal to the number of CPUs. The value after the slash is the
number of kernel scheduling entities that currently exist on the
system.
The fifth field is the PID of the process that was most
recently created on the system.
I would like to comment the accepted answer.
The fourth field consists of two numbers separated by a slash (/). The
first of these is the number of currently executing kernel scheduling
entities (processes, threads); this will be less than or equal to the
number of CPUs.
I did a test program that reads integer N from input and then creates N threads and their run them forever. On RHEL 6.5 computer I have 8 processor and each processor has hyper threading. Anyway if I run my test and it creates 128 threads I see in the fourth field values that are greater than 128, for example 135. It is clearly greater than the number of CPU. This post supports my observation: http://juliano.info/en/Blog:Memory_Leak/Understanding_the_Linux_load_average
It is worth noting that the current explanation in proc(5) manual page
(as of man-pages version 3.21, March 2009) is wrong. It reports the
first number of the forth field as the number of currently executing
scheduling entities, and so predicts it can't be greater than the
number of CPUs. That doesn't match the real implementation, where this
value reports the current number of runnable threads.
The first three columns measure CPU and I/O utilization of the last one, five, and 15 minute periods. The fourth column shows the number of currently running processes and the total number of processes. The last column displays the last process ID used.
https://docs.fedoraproject.org/en-US/Fedora/17/html/System_Administrators_Guide/s2-proc-loadavg.html
The following page explains these in detail:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
Some interpretations:
If the averages are 0.0, then your system is idle.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem (it depends).
You can consult the proc manual page for /proc/loadavg :
$ man proc | sed -n '/loadavg/,/^$/ p'
/proc/loadavg
The first three fields in this file are load average figures giving the number of jobs in the run queue
(state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as
the load average numbers given by uptime(1) and other programs. The fourth field consists of two num‐
bers separated by a slash (/). The first of these is the number of currently runnable kernel schedul‐
ing entities (processes, threads). The value after the slash is the number of kernel scheduling enti‐
ties that currently exist on the system. The fifth field is the PID of the process that was most
recently created on the system.
For that, you need to install the man-pages package on CentOS7/RedHat7 or the manpages package on Ubuntu 20.04/22.04 LTS.

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

Resources