CPU Utilization high for sleeping processes - linux

I have a process that appears to be deadlocked:
# strace -p 5075
Process 5075 attached - interrupt to quit
futex(0x419cf9d0, FUTEX_WAIT, 5095, NULL
It is sitting on the "futex" system call, and seems to be indefinitely waiting on a lock. The process is shown to be consuming a large amount of CPU when "top" is run:
# top -b -n 1
top - 23:13:18 up 113 days, 4:19, 1 user, load average: 1.69, 1.74, 1.72
Tasks: 269 total, 1 running, 268 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.1%us, 0.1%sy, 0.0%ni, 91.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12165696k total, 3810476k used, 8355220k free, 29440k buffers
Swap: 8388600k total, 43312k used, 8345288k free, 879988k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5075 omdb 18 0 2373m 1.7g 26m S 199.7 14.9 102804:11 java
The process is also shown to be in a "S" - Sleep state, which makes sense if it's waiting on some resource. However, I don't understand why CPU utilization would be close to 200% if the process is in the sleep state. Why does top report such high CPU utilization on a sleeping process? Shouldn't its CPU utilization be zero?

There is no correlation between CPU usage as reported by top and process state. The man page says (emphasis mine):
%CPU -- CPU usage
The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time.
So, your process indeed used a huge amount of processor time since the last screen update. It is sleeping, yes, but that's because the currently running process is top itself (which makes sense, since it's currently updating the screen).

Does your application fork child processes? The strace output may indicate that the main process is just waiting for child processes to finish their work. If so, you could try running
strace -f -p 5075
to trace the child processes as well.

The top output is perfectly normal.
The load average calculations include processes that are waiting on something (mutexes/futexes, IO etc) as well as processes that are actually using the CPU. Test it by, say, running something like:
dd if=/dev/sda of=/dev/null
and watching top output to see what happens. It will increase the load average by 1.
If you look at this line:
Cpu(s): 8.1%us, 0.1%sy, 0.0%ni, 91.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
the "id" in "91.8%id" means "idle". So the CPU isn't actually doing much at all.

Let me add my two cents.
Top shows state of the process at a particular moment of time.But IT DOES NOT mean that this process was all the previous time in this state.
This sugestion is completely wrong.
The process could switch between R and S state million times between previous top time and current top moment so if process switches rapidly between R and S state
you can easiky catch it in S state.
However, it uses cpu time between switches.
So please feel the difference between cpu_usage thing ( it describes a period of time ) and state thing ( it describes a particular moment of time ).
Let me give a clear example.
Some person have stolen 3 apples from your pocket during last 10 minutes.
However, right now it does not steal apples from your pocket.
stolen apples = cpu_usage,
the fact that the person does not steal apples right now = state of process
Then, it is completely wrong to get one characteristic and try to predict another one.

Related

Track down high CPU load average

Trying to understand what's going on with my server.
It's a 2 cpu server, so:
$> grep 'model name' /proc/cpuinfo | wc -l
2
While on load avergae, queue is showing ~8 :
$> uptime
16:31:30 up 123 days, 9:04, 1 user, load average: 8.37, 8.48, 8.55
So You can assume, load is really high and things are pailing up, there is some load on the system and it's not just a spike.
However, Looking at top cpu consumers:
> ps -eo pcpu,pid,user,args | sort -k 1 -r | head -6
%CPU PID USER COMMAND
8.3 27187 **** server_process_c
1.0 22248 **** server_process_b
0.5 22282 **** server_process_a
0.0 31167 root head -6
0.0 31166 root sort -k 1 -r
0.0 31165 root ps -eo pcpu,pid,user,args
Results of free command:
total used free shared buffers cached
Mem: 7986 7934 52 0 9 2446
-/+ buffers/cache: 5478 2508
Swap: 17407 60 17347
This is the result on an ongoing basis, e.g. not even
a single CPU is being used, top consumer, is always ~8.5%.
My Question: What are my ways to track down the root of the high load?
Based on your free output, there are times when system memory is exhausted so swap buffer is used (see column used = 60). Total memory used used - (buffers + cached) which result almost zero. It means there are time when all physical RAM is consumed.
For server, try to avoid page fault which may cause swapping data from system memory to swap buffer (or vice versa) as much as possible because accessing hard drive is very slow than system RAM.
In your top output, try to investigate wa column. Higher percentage value means CPU spend more times waiting for data IO from disk rather than doing meaningful computation.
Cpu(s): 87.3%us, 1.2%sy, 0.0%ni, 27.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Try to reduce daemon or service that you do not need to reduce memory footprint and consider to add more RAM to the system.
For 2 CPU(s) server, ideal load is less than 2.0 (each CPU load is less than 1.0). Load of 8.0 means each CPU load is roughly 4.0 which is not very good.
Have you tried the htop command? It shows more information in a helpful way sometimes.

MATLAB CPU usage out of control, even with -singleCompThread

I've got a user who is asking why his MATLAB processes are reading as utilizing 800% CPU usage in top. He has four such MATLAB processes. Here's some specs regarding the server he's on:
# physical processors: 4
abc#server1[~]$ grep "physical id" /proc/cpuinfo | sort -u | wc -l
4
# cores per processor: 8
abc#server1[~]$ grep "cpu cores" /proc/cpuinfo | sort -u | cut -d ":" -f2
8
# logical cores: 32
abc#server1[~]$ grep -c "processor" /proc/cpuinfo
32
4 processes using 800% = 3200. 8 cores x 4 CPUs = 3200. Coincidence? Somehow I doubt it, but I've really got nothing else to contribute to the idea pile, considering these are running with -singleCompThread enabled. Could his code be inefficient or something and causing poor performance or something like that that is out of our control?
What can I look for/do to help diagnose why his CPU usage is through the roof?
Just for completion sake, here's what top looks like (abbreviated to just show his tasks):
Tasks: 768 total, 3 running, 763 sleeping, 2 stopped, 0 zombie
Cpu(s): 0.0%us, 0.1%sy, 99.9%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132141096k total, 52020588k used, 80120508k free, 3343272k buffers
Swap: 16383992k total, 0k used, 16383992k free, 38806216k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16179 user 30 10 3732m 220m 78m S 804.3 0.2 1006:24 MATLAB
16346 user 30 10 3729m 221m 78m S 799.2 0.2 977:04.20 MATLAB
16491 user 30 10 4167m 225m 78m S 788.9 0.2 958:12.45 MATLAB
16623 user 30 10 3473m 227m 78m S 785.1 0.2 960:48.42 MATLAB
Edit: just to clarify, although it says "MATLAB" is his command in top, htop reveals the full command as including -singleCompThread.
Verify that the user is not running multi-threaded MEX functions. The -singleCompThread switch does not control external functions, just built-in MATLAB functions.
There would need to be code changes to the MEX functions to accept an input argument indicating the maximum number of threads. This should be no big deal. I do this in my threaded MEX functions. I'd be surprised if the author did not create some mechanism for specifying the number of threads.
Is your user using functionality (such as matlabpool or parfor) from Parallel Computing Toolbox? These will start up multiple MATLAB Worker processes - typically, and by default, one per processor or per core - each of which is run with -singleCompThread enabled.
This is done to explicitly parallelize a computationally intensive operation across those MATLAB Workers. Those will quite possibly max out the cores they're running on (that's the point of them).

Understanding Linux top CPU utilisation output [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm using an single core small ARM processor running under Debian and have problems understanding the CPU utilisation output of top, see:
top - 15:31:54 up 30 days, 23:00, 2 users, load average: 0.90, 0.89, 0.87
Tasks: 44 total, 1 running, 43 sleeping, 0 stopped, 0 zombie
Cpu(s): 65.0%us, 20.3%sy, 0.0%ni, 14.5%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 61540k total, 40056k used, 21484k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 22260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26028 root 20 0 2536 1124 912 R 1.9 1.8 0:00.30 top
31231 root 19 -1 45260 964 556 S 1.9 1.6 1206:15 owserver
3 root 15 -5 0 0 0 S 0.3 0.0 0:08.68 ksoftirqd/0
694 root 20 0 28640 840 412 S 0.3 1.4 468:26.74 rsyslogd
The column %CPU is very low over all processes, in this example it is all together 4,4% (all other process below had been on 0%)
But the allover CPU on line 3 shows 65%us and 20%sy, so for both a very high value - and by the way, this is how the system feels: very slow :-(
The system is almost always in this condition: very low CPU for all processes, but high user+system CPU.
Can anybody explain why there is such a high inconsistence within the top tool output?
And what tool can I use to better find out what causes the high user+system CPU utilization - top seems to be useless here.
update: meanwhile I've found this thread here, which discusses a similiar question, but I can't verify what is written there:
The command uptime shows the average CPU utilization per 1/5/15 minutes
This is close to what the first line of top outputs as sum of %us+%sy. But this is changing much more, maybe it is an average per 10s?
Even if looking longer time on the top output, the sum of %us+%sy is always several times higher than the summary of all %CPU
Thanks
Achim
You should read the manpage of top to understand its output more astutely. From the manpage:
%CPU -- CPU usage
The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time. The default screen update time is 3 seconds, which can be changed with #top -d ss.tt. To measure commulative CPU usage, run top -S.
-S : Cumulative time mode toggle
Starts top with the last remembered 'S' state reversed. When 'Cumulative mode' is On, each process is listed with the cpu time that it and its dead children have used.
The CPU states are shown in the Summary Area. They are always shown as a percentage and are for the time between now and the last refresh.
us -- User CPU time
The time the CPU has spent running users' processes that are not niced.
sy -- System CPU time
The time the CPU has spent running the kernel and its processes.
ni -- Nice CPU time
The time the CPU has spent running users' proccess that have been niced.
wa -- iowait
Amount of time the CPU has been waiting for I/O to complete.
hi -- Hardware IRQ
The amount of time the CPU has been servicing hardware interrupts.
si -- Software Interrupts
The amount of time the CPU has been servicing software interrupts.
st -- Steal Time
The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).
Under normal circumstances %us+%sy should always be higher.

uptime VS. top CPU usage : What should I believe, why this difference?

I'm having some performance issue on my embedded device:
# uptime
14:59:39 up 5:37, load average: 1.60, 1.50, 1.53
Very bad for a monocore system ... :-p! However if I check with the top utility, I always have an idle time around 80% !
Mem: 49020K used, 75960K free, 0K shrd, 0K buff, 21476K cached
CPU: 12.5% usr 4.8% sys 0.0% nic 81.7% idle 0.0% io 0.9% irq 0.0% sirq
Load average: 1.30 1.42 1.51 1/80 18696
After reading some articles, I would better believe the uptime command. But why this difference? Is my CPU really idle ??!
Load is not just a measure of how many processes in the R state (runnable, could use CPU time), but also processes in the D state (uninterruptable sleep, usually waiting for IO). You likely have a process in the D state which is contributing to load, but not using cpu. This command would show you all the current processes which are contributing to load:
ps aux | awk '$8~/[RD]/'
Have a look at that output and see if you have commands in the D state (in the 8th column)
you'd better to learn what 'load average' stands for.
in short, it's a number of processes, waiting for some resource, and the resource may be CPU, HDD, serial port, ...
The Load average seems a little high, that could meen that the cpu is busy with things like I/O(disk/network) or thread managment(you may have too meny running).

Swap related memory problem in Ubuntu

When I run a top and get it to show swap usage, I get the following output. However, I have disabled swap with swapoff -a prior to starting firefox. What is shown in the SWAP field here then?
When I do cat /proc/meminfo, I get a nonzero value for a field named SwapCached. What is this? My guess it this is the aggregate of all the SWAP values shown in top. How are these related to total memory used by a process?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP
COMMAND
1604 dumrat 20 0 287m 62m 26m R 1 3.1 0:05.03 225m firefox-bin
1415 dumrat 9 -11 94264 4668 3552 S 0 0.2 0:00.10 87m pulseaudio
My best guess id this.
When you say swapoff, it prevents tasks from further 'swapping' (techically, it's paging, not swapping), but does not remove already swapped pages from swap devices. Often various shared libraries go to swap right at the moment of loading: they are here to stay for long time, no point wasting time swapping them when the load is high. These libraries are in RAM as long as they are needed by active processes, but also in swap space.
Maybe Firefox uses some of these librsries that are already mapped to swap space — Xlib, GTK, etc, and this swap space is counter to its 'SWAP' column. Linux tends to count all shared pages to each process that shares them, RAM or not.
Again, this is my guess; take with a grain of salt.

Resources