why can't I match jiffies to uptime? - linux

As far as I know, "jiffies" in Linux kernel is the number of ticks since boot, and the number of ticks in one second is defined by "HZ", so in theory:
(uptime in seconds) = jiffies / HZ
But based on my tests, the above is not true. For example:
$ uname -r
2.6.32-504.el6.x86_64
$ grep CONFIG_HZ /boot/config-2.6.32-504.el6.x86_64
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
So "HZ" is 1000, now look at the jiffies and uptime:
$ grep ^jiffies /proc/timer_list
jiffies: 8833841974
jiffies: 8833841974
...
$ cat /proc/uptime
4539183.14 144549693.77
As we can see, "jiffies" is far different to uptime. I have tested on many boxes, none of the jiffies was even close to uptime. What did I do wrong?

What you're trying to do is how Linux used to work -- 10 years ago.
It's become more complicated since then. Some of the complications that I know of are:
There's an offset of -5 minutes so that the kernel always tests jiffy rollover.
The kernel command line can set a jiffy skip value so a 1000 Hz kernel can run at 250 or 100 or 10.
Various attempts at NoHZ don't use a timer tick at all and rely only on the timer ring and the HPET.
I believe there are some virtual guest extensions that disable the tick and ask the host hypervisor whenever a tick is needed. Such as the Xen or UML builds.
That's why the kernel has functions designed to tell you the time. Use them or figure out what they are doing and copy it.

Well, I hit the same problem. After some research, I finally find the reason why jiffies looks so large compared to uptime.
It is simply because of
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
The real value of INITIAL_JIFFIES is 0xfffb6c20, if HZ is 1000. It's not 0xfffffffffffb6c20.
So if you want compute uptime from jiffies; you have to do
(jiffies - 0xfffb6c20)/HZ

Related

How "real-time" are the FIFO/RR schedulers on non-RT Linux kernel?

Say on a non-RT Linux kernel (4.14, Angstrom distro, running on iMX6) I have a program that receives UDP packets (< 1400 bytes) that come in at a very steady data rate. Basically,
the essence of the program is:
while (true)
{ recv( sockFd, ... );
update_loop_interval_histogram(); // O(1)
}
To minimize the maximally occuring delay time (loop intervals), I started my process with:
chrt --fifo 99 ./programName
setting the scheduler to a "real-time" mode SCHED_FIFO with highest priority.
the CPU affinity of my process is fixed to the 2nd core.
Next to that, I ran a benchmark program instance per core, deliberately getting the CPU load to 100%.
That way, I get a maximum loop interval of ~10ms (vs. ~25ms without SCHED_FIFO). The occurence of this is rare. During e.g. an hour runtime, the counts sum of all intervals <400µs divided by the sum of all counts of all other interval time occurences from 400µs to 10000µs is over 1.5 million.
But as rare as it is, it's still bad.
Is that the best one can reliably get on a non-RealTime Linux kernel, or are there further tweaks to be made to get to something like 5ms maximum interval time?

PERF STAT does not count memory-loads but counts memory-stores

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

How to determine timer frequency in linux

I need to write a kernel module to calculate Linux Kernel Timer (Interrupt) Frequency .
somebody told me I need to use a timer in my module but I don't know how to do that clearly :(
My final goal is to write the result (the frequency) in some file ( for example in: /proc/osfreq/ ) .
=)
There are many ways to get the cpu's time frequency:
1. zcat /proc/config.gz |grep CONFIG_HZ
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
means 250 Hz
2. cat /proc/interrupts |grep LOC; sleep 1;cat /proc/interrupts |grep LOC
LOC: 43986173 44089526 43986113 44089117
LOC: 43986424 44089777 43986364 44089368
means there are 4 logic CPUs, whose frequency is: 43986424 - 43986173 ~=250.
Also, you can get value of var cpu_khz in proc.c at kernel space.
You can just print the global variable HZ's value in the module
using printk, and check the kernel log after loading the module using $dmesg, then you can find the value of HZ.

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."
sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2
The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.
uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

profile program's speed on Linux

I have a couple variants of a program that I want to compare on performance. Both perform essentially the same task.
One does it all in C and memory. The other calls an external utility and does file IO.
How do I reliably compare them?
1) Getting "time on CPU" using "time" favors the second variant for calling system() and doing IO. Even if I add "system" time to "user" time, it'll still not count for time spent blocked on wait().
2) I can't just clock them for they run on a server and can be pushed off the CPU any time. Averaging across 1000s of experiments is a soft option, since I have no idea how my server is utilized - it's a VM on a cluster, it's kind of complicated.
3) profilers do not help since they'll give me time spent in the code, which again favors the version that does system()
I need to add up all CPU time that these programs consume, including user, kernel, IO, and children's recursively.
I expected this to be a common problem, but still don't seem to find a solution.
(Solved with times() - see below. Thanks everybody)
If I've understood, typing "time myapplication" on a bash command line is not what you are looking for.
If you want accuracy, you must use a profiler... You have the source, yes?
Try something like Oprofile or Valgrind, or take a look at this for a more extended list.
If you haven't the source, honestly I don't know...
/usr/bin/time (not built-in "time" in bash) can give some interesting stats.
$ /usr/bin/time -v xeyes
Command being timed: "xeyes"
User time (seconds): 0.00
System time (seconds): 0.01
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.57
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 517
Voluntary context switches: 243
Involuntary context switches: 0
Swaps: 0
File system inputs: 1072
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run them a thousand times, measure actual time taken, then average the results. That should smooth out any variances due to other applications running on your server.
I seem to have found it at last.
NAME
times - get process times
SYNOPSIS
#include
clock_t times(struct tms *buf);
DESCRIPTION
times() stores the current process times in the struct tms that buf
points to. The struct tms is as defined in :
struct tms {
clock_t tms_utime; /* user time */
clock_t tms_stime; /* system time */
clock_t tms_cutime; /* user time of children */
clock_t tms_cstime; /* system time of children */
};
The children's times are a recursive sum of all waited-for children.
I wonder why it hasn't been made a standard CLI utility yet. Or may be I'm just ignorant.
I'd probably lean towards adding "time -o somefile" to the front of the system command, and then adding it to the time given by time'ing your main program to get a total. Unless I had to do this lots of times, then I'd find a way to take two time outputs and add them up to the screen (using awk or shell or perl or something).

Resources