How does proc stats work - linux

I have done a lot of reading and testing of the proc directory in OS's using the Linux kernel. I have been using Linux myself for many years now, but I needed to get into more details for a small private project. Specifically how the stat files work. I knew the basics, but not enough to create actual calculations with the data in them.
The problem is that the files in proc does not seem to contain what they should, not according to what I have read vs. my tests.
For example: the CPU line in the root stat file should contain the total uptime for the CPU times the amount of cores (and/or amount of CPU's) in jiffies. So to get the system uptime, you would have to add each number in the row to each other, divide by the number of cores/CPU's and again divide by whatever a jiffie is defined to be on that particular system. At least this is the formula that I keep finding when searching this subject. If this was true, then the result should be equal to the first number in /proc/uptime? But this is not the case, and I have tested this on several machines with different amount of cores, both 32bit and 64bit systems. I can never get these two to match up.
Also the stat file for each pid have an uptime part (part 21 I think it was). But I cannot figure out what this number should be matched against to calculate a process uptime in seconds. So far what I have read, it should contain the total cpu jiffies as they was when the process was started. So if this is true, then one would simply substract this from the current total cpu jiffies and divide this with whatever a jiffie is on that system? But again, I cannot seam to get this to add up to reality.
Then there is the problem with finding out what a jiffie is. I found a formula where /proc/stat was used together with /proc/uptime and some dividing with the amount of cores/CPU's to get that number. But this does not work. And I would not expect it to when the values of those two files does not add up, like mentioned in my first problem above. I did however come up with a different approach. Simply reading the first line of /proc/stat twice within a second. Then I could just compare and see how many jiffies the system had added in that second and divide that with the number of cores. This works on normal Linux systems, but it fails on Android in most cases. Android is constantly attaching/detaching cores depending on needs, which means that it differs how much you have to divide with. It is no problem as long as the core count matches both reads, but if one core goes active during the second read, it does not work.
And last. I do not quite get the part by dividing by amount of cores. If each core writes all of it's work time and idle time to the total line in /proc/stat, then it would make sense as that line would actually contain the total uptime times the amount of cores. But if this was true then each of the cpu lines would add up to the same number, but they don't. This means that dividing by amount of cores should provide an incorrect result. But that would also mean that cpu monitor tools are making calculation errors, as they all seam to use this method.
Example:
/proc/stat
cpu 20455737 116285 4584497 104527701 1388173 366 102373 0 0 0
cpu0 4833292 5490 1413887 91023934 1264884 358 94250 0 0 0
cpu1 5785289 47944 1278053 4439797 45015 1 4235 0 0 0
cpu2 4748431 20922 926839 4552724 33455 2 2745 0 0 0
cpu3 5088724 41928 965717 4511246 44819 3 1141 0 0 0
The lines cpu0, cpu1, cpu2 and cpu3 does not add up to the same total result. This means that using the total result of the general cpu line divided by 4 should be incorrect.
/proc/uptime
1503361.21 3706840.53
All of the above output was taken from a system that should be using clock ticks of 100. Now if you take the result of the general cpu line, divide that with 100 and then with 4 (amount of cores), you will not get the result of the uptime file.
And if you take the result of the general cpu line, divide that with the uptime from /proc/uptime and then with 4 (amount of cores), you will not get the 100 that is this kernels clock ticks.
So why is nothing adding up as it should? How do I get the clock ticks of a kernel, even on systems that attaches/detaches cores constantly? How to I get the total real uptime of a process? How do I get the real uptime from /proc/stat?

(This answer is based on the 4.0 kernel.)
The first number on each line of /proc/stat is the total time which each CPU has spent executing non-"nice" tasks in user mode. The second is the total time spent executing "nice" tasks in user mode.
Naturally, there will be random variations across CPUs. For example, the processes running on one CPU may make more syscalls, or slower syscalls, than those running on another CPU. Or one CPU may happen to run more "nice" tasks than another.
The first number in /proc/uptime is the "monotonic boot time" -- the amount of time which has passed since the system was last booted (including time which passed while the system was suspended). The second number is the total amount of time which all CPUs have spent idling.
There is also a task-specific stat file for each PID in the corresponding subdirectory under /proc. This one starts with a PID number, a name in brackets, and a state code (represented by a character). The 19th number after that is the start time of the process, in ticks.
All of this information is not very hard to find simply by browsing the Linux source code. I recommend you clone a local copy of Linus' repo and use grep to find the details you need. As a tip, the process-specific files in /proc are implemented in fs/proc/base.c. The /proc/stat file which you asked about is implemented in fs/proc/stat.c. proc/uptime is implemented in fs/proc/uptime.c.
man proc is also a good source of information.

Related

What is the load on the system

I have a Red hat server where I can see the load average on the system is 23 24 23 (1min 5min 15min) using the top command. And i can see in /proc/cpuinfo there are 24 processors entry (0-23). But in each processor entry the cpu cores value is 6 and in each processor entry the physical id is either 1 or 0.
I want to know if my system is overloaded. Can anyone please tell me.
It seems you have a system with two processors, each with 6 cores. Each of the cores can likely run hyperthread => 2 x 6 x 2 = 24. In /proc/cpuinfo, top etc, you will see each hyperthread: that's the number of parallel processes or threads that your hardware can run.
The quick answer is that your system is probably not overloaded and that it processes a relatively stable amount of work over time (as the 1, 5 and 15 minute values are about the same). A rule of thumb is that the load average value should stay below the number of hyperthreads -- this is not, however, exactly true.
You'll find a more in-depth discussion here:
https://unix.stackexchange.com/questions/303699/how-is-the-load-average-interpreted-in-top-output-is-it-the-same-for-all-di
and here:
https://superuser.com/questions/23498/what-does-load-average-mean-on-unix-linux
and perhaps here:
https://linuxhint.com/load_average_linux/
However, please keep in mind that load average does not tell you everything about your system -- though it is usually a pretty good indicator in my experience. You'd have to check many other factors to determine overloadedness, like memory pressure, I/O wait times, I/O bandwidth utilization. It also depends on what kind of processing the system is performing.

Linux command that tracks statistics of CPU usage while running application on HPC/HTC

In my PBS script, I am running matlab and would like to know how many many cores were actually used during the time. Especially I would like to know the max number of cores used at a time.
If I only allocate x number of cores but at any time matlab uses more than x number of cores then my job will be stopped and cancelled by the HPC/HTC system.
Ideally the command and output would be as simple as
cpustats matlab -nojvm -r "someExperiment(params);exit()"
Max CPU usage: 12.5 cores
Average CPU usage: 6 cores
Min CPU usage: 0.5 cores
I can't monitor the progress manually because it is a batch script so I am planning on running once with plenty of cores and then modifying the rest so I don't have to wait so long.
I have searched and searched for a command like this but the following don't seem to be what I am looking for
top finds the current cpu usage which I don't have access to
ps finds cpu allotted to a process and not actual usage
watch might be useful to query random cpu times and output them but would like a continuous stream if possible
time is really close to what I want but doesn't keep track of peak CPU usage
The most similar question I could find was this one about peak memory usage

Linux acceptable load average

I have a linux dedicated server machine(8cores 8gbRAM) where i run some crawler php scripts. The load on the system ends up being arround 200, which sounds a lot. Since i am not using the machine to host content, what could be the sideeffects of such high level of load for the purposes stated above.
Machines were made to work so there are no issues with high load average, per se. But, a high load average can be an indicator of a performance issue, often. Such investigation is usually application specific, but here is a very general guideline:
Since load average is a combined metric of (CPU, IO .. etc) you want to examine all separately. I would start with making sure the machine is not thrashing, by checking swap usage (vmstat comes in handy), and disk performance (using iostat). You may also check if your operations are CPU intensive.
You should read your load average value as a 3 component value (1 minute load, 5 minutes load and 15 minutes load respectively).
Take a look at the example taken from Wiki:
For example, one can interpret a load average of "1.73 0.60 7.98" on a single-CPU system as:
during the last minute, the system was overloaded by 73% on average (1.73 runnable processes, so that 0.73 processes had to wait for a turn for a single CPU system on average).
during the last 5 minutes, the CPU was idling 40% of the time on average.
during the last 15 minutes, the system was overloaded 698% on average (7.98 runnable processes, so that 6.98 processes had to wait for a turn for a single CPU system on average).
Full article
Please note that this value depends on the resources of your machine.
Cheers!

How is nice cpu percentage calculated, e.g. in top?

My research group is sharing time on a CentOS server, and we've been using renice +15 to try to lower the priority of long-running background tasks. When running top, these processes do show up as having a nice value of 15, but the "%ni" measure of nice cpu load is always very low (less than 1%) even when these processes are churning along on 30 cores (as reported in the %CPU column). This has made us think that we are not actually using renice correctly (though the nice processes do seem to yield to higher-priority tasks). How exactly is the nice cpu percentage calculated in top?
The numbers in top come from reading the file /proc/stat. The first line contains a summary for all the cpus combined. The first column is usr time, the second nice time. These times are in clock ticks, usually 100 per second, and are cumulative, so you have to look over and interval and subtract the start value from the end value. You can see the docs for more details, I like http://man7.org/linux/man-pages/man5/proc.5.html
The Linux kernel adds CPU time to the nice column if the nice value is greater than 0, otherwise it puts it in the usr column.
The nice value for an individual process can be found by looking at column 19 of /proc/[pid]/stat. That number should be 15 for you, and the number in column 18 should be 35 (The kernel's internal interpretation of a nice of 15.) However, if top is showing those as 15 in the NI column, it is getting that value from /proc/[pid]/stat.
Comparing the CPU time used in usr and sys in /proc/[pid]/stat and then usr, nice and sys in /proc/stat will give you a good idea of where the time is going. Maybe there are just a ton of CPUs on the system.

Is the UNIX `time` command accurate enough for benchmarks? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Let's say I wanted to benchmark two programs: foo.py and bar.py.
Are a couple thousand runs and the respective averages of time python foo.py and time python bar.py adequate enough for profiling and comparing their speed?
Edit: Additionally, if the execution of each program was sub-second (assume it wasn't for the above), would time still be okay to use?
time produces good enough times for benchmarks that run over one second otherwise the time it took exec()ing a process may be large compared to its run-time.
However, when benchmarking you should watch out for context switching. That is, another process may be using CPU thus contending for CPU with your benchmark and increasing its run time. To avoid contention with other processes you should run a benchmark like this:
sudo chrt -f 99 /usr/bin/time --verbose <benchmark>
Or
sudo chrt -f 99 perf stat -ddd <benchmark>
sudo chrt -f 99 runs your benchmark in FIFO real-time class with priority 99, which makes your process the top priority process and avoids context switching (you can change your /etc/security/limits.conf so that it doesn't require a privileged process to use real-time priorities).
It also makes time report all the available stats, including the number of context switches your benchmark incurred, which should normally be 0, otherwise you may like to rerun the benchmark.
perf stat -ddd is even more informative than /usr/bin/time and displays such information as instructions-per-cycle, branch and cache misses, etc.
And it is better to disable the CPU frequency scaling and boost, so that the CPU frequency stays constant during the benchmark to get consistent results.
Nowadays, imo, there is no reason to use time for benchmarking purposes. Use perf stat instead. It gives you much more useful information and can repeat the benchmarking process any given number of time and do statistics on the results, i.e. calculate variance and mean value. This is much more reliable and just as simple to use as time:
perf stat -r 10 -d <your app and arguments>
The -r 10 will run your app 10 times and do statistics over it. -d outputs some more data, such as cache misses.
So while time might be reliable enough for long-running applications, it definitely is not as reliable as perf stat. Use that instead.
Addendum: If you really want to keep using time, at least don't use the bash-builtin command, but the real-deal in verbose mode:
/usr/bin/time -v <some command with arguments>
The output is then e.g.:
Command being timed: "ls"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 93
Voluntary context switches: 1
Involuntary context switches: 2
Swaps: 0
File system inputs: 8
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Especially note how this is capable of measuring the peak RSS, which is often enough if you want to compare the effect of a patch on the peak memory consumption. I.e. use that value to compare before/after and if there is a significant decrease in the RSS peak, then you did something right.
Yes, time is accurate enough. And you'll need to run only a dozen of times your programs (provided the run lasts more than a second, or a significant fraction of a second - ie more than 200 milliseconds at least). Of course, the file system would be hot (i.e. small files would already be cached in RAM) for most runs (except the first), so take that into account.
the reason you want to have the time-d run to last a few tenths of seconds at least is the accuracy and granularity of the time measurement. Don't expect less than hundredth of second of accuracy. (you need some special kernel option to have it one millisecond)
From inside the application, you could use clock, clock_gettime, gettimeofday,
getrusage, times (they surely have a Python equivalent).
Don't forget to read the time(7) man page.
Yes. The time command gives both elapsed time as well as consumed CPU. The latter is probably what you should focus on, unless you're doing a lot of I/O. If elapsed time is important, make sure the system doesn't have other significant activity while running your test.

Resources