uptime VS. top CPU usage : What should I believe, why this difference? - linux

I'm having some performance issue on my embedded device:
# uptime
14:59:39 up 5:37, load average: 1.60, 1.50, 1.53
Very bad for a monocore system ... :-p! However if I check with the top utility, I always have an idle time around 80% !
Mem: 49020K used, 75960K free, 0K shrd, 0K buff, 21476K cached
CPU: 12.5% usr 4.8% sys 0.0% nic 81.7% idle 0.0% io 0.9% irq 0.0% sirq
Load average: 1.30 1.42 1.51 1/80 18696
After reading some articles, I would better believe the uptime command. But why this difference? Is my CPU really idle ??!

Load is not just a measure of how many processes in the R state (runnable, could use CPU time), but also processes in the D state (uninterruptable sleep, usually waiting for IO). You likely have a process in the D state which is contributing to load, but not using cpu. This command would show you all the current processes which are contributing to load:
ps aux | awk '$8~/[RD]/'
Have a look at that output and see if you have commands in the D state (in the 8th column)

you'd better to learn what 'load average' stands for.
in short, it's a number of processes, waiting for some resource, and the resource may be CPU, HDD, serial port, ...

The Load average seems a little high, that could meen that the cpu is busy with things like I/O(disk/network) or thread managment(you may have too meny running).

Related

Weird EC2 CPU usage

I'm really confused. Why does the load average and %CPU does not match the process CPU usage below. It seems like the process is eating up a lot of CPU while the AWS EC2 meters only says 25% CPU is used.
%CPU -- CPU Usage : The percentage of your CPU that is being used by the process. By default, top displays this as a percentage
of a single CPU. On multi-core systems, you can have percentages
that are greater than 100%. For example, if 3 cores are at 60% use,
top will show a CPU use of 180%.
You can toggle this behavior by hitting Shift+i while top is running to show the overall percentage of available
CPUs in use.
load average: 22.56, 24.99, 26.51
From left to right, these numbers show you the average load over the last 1 minute, the last 5 minutes, and the last 15 minutes.
us -- User CPU time
The time the CPU has spent running users' processes that are not niced.
sy -- System CPU time
The time the CPU has spent running the kernel and its processes.
ni -- Nice CPU time
The time the CPU has spent running users' proccess that have been niced.
wa -- iowait
Amount of time the CPU has been waiting for I/O to complete.
hi -- Hardware IRQ
The amount of time the CPU has been servicing hardware interrupts.
si -- Software Interrupts
The amount of time the CPU has been servicing software interrupts.
st -- Steal Time
The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).
See more details from In Linux “top” command what are us, sy, ni, id, wa, hi, si and st (for CPU usage).
after you run command "top" you can press "1" on your keyboard to see individual CPU utilization, more details when you run command "man top"
Note process "msqld" can use CPU from several resources and its utilization % could easily go beyond 100% in "top" display.
Hi maybe your app using single core and other cores are free. I think your instance has 4 CPU core and one is utilizing 100%. can you please check utilization by each core.

Track down high CPU load average

Trying to understand what's going on with my server.
It's a 2 cpu server, so:
$> grep 'model name' /proc/cpuinfo | wc -l
2
While on load avergae, queue is showing ~8 :
$> uptime
16:31:30 up 123 days, 9:04, 1 user, load average: 8.37, 8.48, 8.55
So You can assume, load is really high and things are pailing up, there is some load on the system and it's not just a spike.
However, Looking at top cpu consumers:
> ps -eo pcpu,pid,user,args | sort -k 1 -r | head -6
%CPU PID USER COMMAND
8.3 27187 **** server_process_c
1.0 22248 **** server_process_b
0.5 22282 **** server_process_a
0.0 31167 root head -6
0.0 31166 root sort -k 1 -r
0.0 31165 root ps -eo pcpu,pid,user,args
Results of free command:
total used free shared buffers cached
Mem: 7986 7934 52 0 9 2446
-/+ buffers/cache: 5478 2508
Swap: 17407 60 17347
This is the result on an ongoing basis, e.g. not even
a single CPU is being used, top consumer, is always ~8.5%.
My Question: What are my ways to track down the root of the high load?
Based on your free output, there are times when system memory is exhausted so swap buffer is used (see column used = 60). Total memory used used - (buffers + cached) which result almost zero. It means there are time when all physical RAM is consumed.
For server, try to avoid page fault which may cause swapping data from system memory to swap buffer (or vice versa) as much as possible because accessing hard drive is very slow than system RAM.
In your top output, try to investigate wa column. Higher percentage value means CPU spend more times waiting for data IO from disk rather than doing meaningful computation.
Cpu(s): 87.3%us, 1.2%sy, 0.0%ni, 27.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Try to reduce daemon or service that you do not need to reduce memory footprint and consider to add more RAM to the system.
For 2 CPU(s) server, ideal load is less than 2.0 (each CPU load is less than 1.0). Load of 8.0 means each CPU load is roughly 4.0 which is not very good.
Have you tried the htop command? It shows more information in a helpful way sometimes.

memory usage more than 100%

I am using arm processor and one qt based gui application.
There is an issue of slow process.
Mem: 36272K used, 24692K free, 0K shrd, 188K buff, 19544K cached
CPU: 6.1% usr 1.3% sys 0.0% nic 92.4% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 0.25 0.18 0.07 1/43 553
PID : 512
PPID : 1
USER : root
STAT : S
VSZ : 62368
%MEM : 102.0
CPU : 0
%CPU : 5.5
COMMAND : ./gopaljeearm -qws -nomouse
This is status when i use top command.
There is a very nice answer for Android applications which in turn should be applicable for most of the Linux applications. Quoting...
Note that memory usage on modern operating systems like Linux is an
extremely complicated and difficult to understand area. In fact the
chances of you actually correctly interpreting whatever numbers you
get is extremely low.
you can read rest of it here.
Another nice read is ELC: How much memory are applications really using? from LWN.

Understanding Linux top CPU utilisation output [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm using an single core small ARM processor running under Debian and have problems understanding the CPU utilisation output of top, see:
top - 15:31:54 up 30 days, 23:00, 2 users, load average: 0.90, 0.89, 0.87
Tasks: 44 total, 1 running, 43 sleeping, 0 stopped, 0 zombie
Cpu(s): 65.0%us, 20.3%sy, 0.0%ni, 14.5%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 61540k total, 40056k used, 21484k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 22260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26028 root 20 0 2536 1124 912 R 1.9 1.8 0:00.30 top
31231 root 19 -1 45260 964 556 S 1.9 1.6 1206:15 owserver
3 root 15 -5 0 0 0 S 0.3 0.0 0:08.68 ksoftirqd/0
694 root 20 0 28640 840 412 S 0.3 1.4 468:26.74 rsyslogd
The column %CPU is very low over all processes, in this example it is all together 4,4% (all other process below had been on 0%)
But the allover CPU on line 3 shows 65%us and 20%sy, so for both a very high value - and by the way, this is how the system feels: very slow :-(
The system is almost always in this condition: very low CPU for all processes, but high user+system CPU.
Can anybody explain why there is such a high inconsistence within the top tool output?
And what tool can I use to better find out what causes the high user+system CPU utilization - top seems to be useless here.
update: meanwhile I've found this thread here, which discusses a similiar question, but I can't verify what is written there:
The command uptime shows the average CPU utilization per 1/5/15 minutes
This is close to what the first line of top outputs as sum of %us+%sy. But this is changing much more, maybe it is an average per 10s?
Even if looking longer time on the top output, the sum of %us+%sy is always several times higher than the summary of all %CPU
Thanks
Achim
You should read the manpage of top to understand its output more astutely. From the manpage:
%CPU -- CPU usage
The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time. The default screen update time is 3 seconds, which can be changed with #top -d ss.tt. To measure commulative CPU usage, run top -S.
-S : Cumulative time mode toggle
Starts top with the last remembered 'S' state reversed. When 'Cumulative mode' is On, each process is listed with the cpu time that it and its dead children have used.
The CPU states are shown in the Summary Area. They are always shown as a percentage and are for the time between now and the last refresh.
us -- User CPU time
The time the CPU has spent running users' processes that are not niced.
sy -- System CPU time
The time the CPU has spent running the kernel and its processes.
ni -- Nice CPU time
The time the CPU has spent running users' proccess that have been niced.
wa -- iowait
Amount of time the CPU has been waiting for I/O to complete.
hi -- Hardware IRQ
The amount of time the CPU has been servicing hardware interrupts.
si -- Software Interrupts
The amount of time the CPU has been servicing software interrupts.
st -- Steal Time
The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).
Under normal circumstances %us+%sy should always be higher.

How to measure lock contention?

I'm reading http://lse.sourceforge.net/locking/dcache/dcache_lock.html, in which spinlock time for each functions is measured:
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
5.3% 16.5% 0.6us(2787us) 5.0us(3094us)(0.89%) 15069563 83.5% 16.5% 0% dcache_lock
0.01% 10.9% 0.2us( 7.5us) 5.3us( 116us)(0.00%) 119448 89.1% 10.9% 0% d_alloc+0x128
0.04% 14.2% 0.3us( 42us) 6.3us( 925us)(0.02%) 233290 85.8% 14.2% 0% d_delete+0x10
0.00% 3.5% 0.2us( 3.1us) 5.6us( 41us)(0.00%) 5050 96.5% 3.5% 0% d_delete+0x94
I'd like to know where these statistics are from. I tried oprofile, but it seems oprofile cannot measure lock holding and waiting time for a specific lock. And valgrind's drd slows down applications too much, which will make the result less accurate and also consume too much time. mutrace seems good, but as the name points out, I'm afraid it can only trace mutex exclusions.
So is there any other tool, or how to use the tools I mentioned above, to get lock contention statistics?
Thanks for your reply.
Finally I find the performance measuring tool used in the article, which needs to patch kernel .
The introduction page can be found at http://oss.sgi.com/projects/lockmeter/, and the latest kernel patch corresponds to kernel version 2.6.16, which you can download here.
One way to tell is just get it running, pause it, and take a random stackshot of all the threads. Then do it again, several times. Then the fraction of stack samples that terminate in locking code is the percent of time you are after, roughly. It will also tell you which locations the locking is performed in. If you're after accuracy, take more samples. This works in any language or operating system.

Resources