Linux(Ubuntu) load average higher than total-true-utilization? - linux

I have a dell pd2950(2x4core) server running Ubuntu server 12.04LTS. And there's a VLC encoder instance running. Recently I updated the script(VLM) for VLC to increase quality and this means I'm increasing the CPU utilization too. So I started to tune the script to avoid exceeding maximum utilization. I use top to monitor the CPU utilization. I found that the load average is higher than 100%(I have 8-cores totally so 8.00 is 100%) but there's still 20-35% is idle, like:
top - 21:41:19 up 2 days, 17:15, 1 user, load average: 9.20, 9.65, 8.80
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
Cpu(s): 32.8%us, 0.7%sy, 29.7%ni, 36.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1982680k total, 1735672k used, 247008k free, 126284k buffers
Swap: 0k total, 0k used, 0k free, 774228k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9715 wilson RT 0 2572m 649m 13m S 499 33.5 13914:44 vlc
11663 wilson 20 0 17344 1328 964 R 2 0.1 0:02.00 top
1 root 20 0 24332 2264 1332 S 0 0.1 0:01.06 init
2 root 20 0 0 0 0 S 0 0.0 0:00.09 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:27.05 ksoftirqd/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/0:0
5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H
To confirm my CPU(s) don't have Hyper-Thread, I tried:
wilson#server:/$ nproc
8
And to reduce the sampling deviation cause by refresh time, I also tried:
wilson#server:/$ top -d 0.1
I looked at the number %id for a long time, it haven't been lower than 14.
I also tried:
wilson#server:/$ uptime
21:57:20 up 2 days, 17:31, 1 user, load average: 9.03, 9.12, 9.35
The 1m load average often reach 14-15. So I'm wondering what's wrong with my system? Has anyone ever have this problem?
More information:
I'm using VLC with x264 codec to encode a live HTTP stream(application/octet-stream). It use ffmpeg(libavc) to decode and output as Apple HLS(.ts segment). I found this problem after I added arguments for x264:
level=41,ref=5,b-adapt=2,direct=auto,me=umh,subq=8,rc-lookahead=60,analyse=all
This almost equal to preset=slower. And as you can see, my VLC in running in real-time. The parameter is:
wilson#server:/$ chrt -p -f 99 vlc-wrapper

There does not appear to be anything wrong with your system. What is wrong seems to be your understanding of CPU accounting. In particular, load average has nearly nothing at all to do with CPU usage. Load average is based on the number of processes that are ready to run (not waiting on I/O, network, keyboard input, etc...), if there is an available CPU for them to be scheduled on. While it's true that, given an 8 core system, if all 8 cores are 100% busy with a single CPU-bound thread each, your load average should be around 8.00, it is entirely possible to have a load average of 200.0 with near-0% CPU utilization. All that would indicate is you have 200 processes that are ready to run, but as soon as they get scheduled, they do almost nothing before they go back to waiting for input of some sort.
Your top output shows that vlc seems to be using roughly the equivalent of 5 of your cores, but it doesn't indicate whether you have 5 cores at 100% each, or if all 8 cores are at 62.5% each. All of the other processes listed by top also contribute to your load average, as well as CPU usage. In particular, top running with a short delay like your example of 0.1 seconds, will probably increase your load average by almost 1 itself, even though, overall, it's not using a lot of CPU time.

Read this:
Understanding load average vs. cpu usage
If the load average is at 7, with 4 hyper-threaded processors, shouldn't that means that the CPU is working to about 7/8 capacity?
No it just means that you have 7 running processes in the job queue on average.
But I think that we can't use load average as a reference number to determine system is overload or not. So that I wonder if there's a kernel-level cpu utitlization statistical tools or not?(why kernel level because reduce performance loss)

Related

Calculate cpu through linux commands

I would like to know the cpu uage and i tried to get through the 'top' commands.
But seems used CPU above it shows "19 %" while in the process list it shows for 100% for cpu.
So please let me know how to get the exact value for CPU usage.
top - 05:14:39 up 34 days, 14:57, 1 user, load average: 0.20, 0.31, 0.30
Tasks: 231 total, 2 running, 184 sleeping, 1 stopped, 1 zombie
%Cpu(s): 19.0 us, 2.3 sy, 0.0 ni, 78.4 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 16123248 total, 3329216 free, 7078736 used, 5715296 buff/cache
KiB Swap: 1048572 total, 743164 free, 305408 used. 9380980 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27928 root 20 0 415656 10196 5212 R 100.0 0.1 0:00.17 runc:[2:INIT]
27933 karthik+ 20 0 33992 3496 2956 R 6.2 0.0 0:00.01 top
enter image description here
Thanks in Advance
top only shows total CPU usage, that's different from which you see in details, try other cmds show more detail information.
on Linux: try mpstat -P ALL 1 cmd, show CPU load per core.
on Mac: try install htop
The cpu usage shown in top and ps are not showing exactly the same statistics.
The 'top' command by default shows the summary of cpu usage across all cores. If you press '1' in 'top', you will see the usage for each core.
The output of 'ps' shows the total cpu usage time divided by process run time. Having 100 may mean that your process used single core 100% during run time. On 4-cores cpu, the top will probably calculate around 25% total cpu usage.
The 'man' page for both commands may provide some additional details as well.

Optimal number of threads for GNU parallel

I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it!
I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq
As you can see I specified 8 threads, which is the amount of threads my virtual machine has.
My question now is this following:
If I use GNU parallel with a command like this:
cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq
Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?
Or do I need 24 threads (3*8 threads) to properly excecute this command?
Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!
The best advice is simply: Try different values and measure.
In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.
top is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.
top - 14:49:10 up 10 days, 5:48, 123 users, load average: 2.40, 1.72, 1.67
Tasks: 751 total, 3 running, 616 sleeping, 8 stopped, 4 zombie
%Cpu(s): 17.3 us, 6.2 sy, 0.0 ni, 76.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
GiB Mem : 31.239 total, 1.441 free, 21.717 used, 8.081 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.088 used. 4.706 avail Mem
This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.
top - 14:51:00 up 10 days, 5:50, 124 users, load average: 3.41, 2.04, 1.78
Tasks: 759 total, 8 running, 619 sleeping, 8 stopped, 4 zombie
%Cpu(s): 92.8 us, 6.9 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
GiB Mem : 31.239 total, 1.383 free, 21.772 used, 8.083 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.087 used. 4.649 avail Mem
This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.
So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log can be useful to see how long a job takes).
And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).
Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.

Understanding CPU from TOP command having multi-core processor

I am currently using the TOP command to fetch the CPU and Memory of a process. My query here is on understanding the value it displays.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6742 aaaa 20 0 843596 1.0g 238841 S 4.0 1.7 0:49.66 java
14355 aaaa 20 0 658704 749560 234112 S 3.3 1.2 15:45.75 java
2779 aaaa 20 0 688868 846620 160844 S 3.0 1.4 54:30.61 java
2337 aaaa 20 0 701200 1.0g 231923 S 2.3 1.7 13:18.34 java
Let's say, I'm monitoring the CPU of a process ID 6742, it sometimes show 4%, sometimes 8%, 6% and sometimes it shoots up to 200% and comes back.
When i check the number of cores the system has, it says 8.
nproc -> 8
So should I calculate the CPU which is shown in the TOP command or should i calculate it based on the number of cores, like since it has 8 cores, so out of 800%, the CPU is 200% for that process ID.
How should we calculate this scenario?

Making sense from GHC profiler

I'm trying to make sense from GHC profiler. There is a rather simple app, which uses werq and lens-aeson libraries, and while learning about GHC profiling, I decided to play with it a bit.
Using different options (time tool, +RTS -p -RTS and +RTS -p -h) I acquired entirely different numbers of my memory usage. Having all those numbers, I'm now completely lost trying to understand what is going on, and how much memory the app actually uses.
This situation reminds me the phrase by Arthur Bloch: "A man with a watch knows what time it is. A man with two watches is never sure."
Can you, please, suggest me, how I can read all those numbers, and what is the meaning of each of them.
Here are the numbers:
time -l reports around 19M
#/usr/bin/time -l ./simple-wreq
...
3.02 real 0.39 user 0.17 sys
19070976 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21040 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
71 messages sent
71 messages received
2991 signals received
43 voluntary context switches
6490 involuntary context switches
Using +RTS -p -RTS flag reports around 92M. Although it says "total alloc" it seems strange to me, that a simple app like this one can allocate and release 91M
# ./simple-wreq +RTS -p -RTS
# cat simple-wreq.prof
Fri Oct 14 15:08 2016 Time and Allocation Profiling Report (Final)
simple-wreq +RTS -N -p -RTS
total time = 0.07 secs (69 ticks # 1000 us, 1 processor)
total alloc = 91,905,888 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main.g Main 60.9 88.8
MAIN MAIN 24.6 2.5
decodeLenient/look Data.ByteString.Base64.Internal 5.8 2.6
decodeLenientWithTable/fill Data.ByteString.Base64.Internal 2.9 0.1
decodeLenientWithTable.\.\.fill Data.ByteString.Base64.Internal 1.4 0.0
decodeLenientWithTable.\.\.fill.\ Data.ByteString.Base64.Internal 1.4 0.1
decodeLenientWithTable.\.\.fill.\.\.\.\ Data.ByteString.Base64.Internal 1.4 3.3
decodeLenient Data.ByteString.Base64.Lazy 1.4 1.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 443 0 24.6 2.5 100.0 100.0
main Main 887 0 0.0 0.0 75.4 97.4
main.g Main 889 0 60.9 88.8 75.4 97.4
object_ Data.Aeson.Parser.Internal 925 0 0.0 0.0 0.0 0.2
jstring_ Data.Aeson.Parser.Internal 927 50 0.0 0.2 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 923 600 0.0 0.3 0.0 0.3
decodeLenient Data.ByteString.Base64.Lazy 891 0 1.4 1.4 14.5 8.1
decodeLenient Data.ByteString.Base64 897 500 0.0 0.0 13.0 6.7
....
+RTS -p -h and hp2ps show me the following picture and two numbers: 114K in the header and something around 1.8Mb on the graph.
And, just in case, here is the app:
module Main where
import Network.Wreq
import Control.Lens
import Data.Aeson.Lens
import Control.Monad
main :: IO ()
main = replicateM_ 10 g
where
g = do
r <- get "http://httpbin.org/get"
print $ r ^. responseBody
. key "headers"
. key "User-Agent"
. _String
UPDATE 1: Thank everyone for incredible good responses. As was suggested, I add +RTS -s output, so the entire picture builds up for everyone who read it.
#./simple-wreq +RTS -s
...
128,875,432 bytes allocated in the heap
32,414,616 bytes copied during GC
2,394,888 bytes maximum residency (16 sample(s))
355,192 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 194 colls, 0 par 0.018s 0.022s 0.0001s 0.0022s
Gen 1 16 colls, 0 par 0.027s 0.031s 0.0019s 0.0042s
UPDATE 2: The size of the executable:
#du -h simple-wreq
63M simple-wreq
A man with a watch knows what time it is. A man with two watches is never sure.
Ah, but what do does two watches show? Are both meant to show the current time in UTC? Or is one of them supposed to show the time in UTC, and the other one the time on a certain point on Mars? As long as they are in sync, the second scenario wouldn't be a problem, right?
And that is exactly what is happening here. You compare different memory measurements:
the maximum residency
the total amount of allocated memory
The maximum residency is the highest amount of memory your program ever uses at a given time. That's 19MB. However, the total amount of allocated memory is a lot more, since that's how GHC works: it "allocates" memory for objects that are garbage collected, which is almost everything that's not unpacked.
Let us inspect a C example for this:
int main() {
int i;
char * mem;
for(i = 0; i < 5; ++i) {
mem = malloc(19 * 1000 * 1000);
free(mem);
}
return 0;
}
Whenever we use malloc, we will allocate 19 megabytes of memory. However, we free the memory immediately after. The highest amount of memory we ever have at one point is therefore 19 megabytes (and a little bit more for the stack and the program itself).
However, in total, we allocate 5 * 19M, 95M total. Still, we could run our little program with just 20 megs of RAM fine. That's the difference between total allocated memory and maximum residency. Note that the residency reported by time is always at least du <executable>, since that has to reside in memory too.
That being said, the easiest way to generate statistics is -s, which will show how what was the maximum residency from the Haskell's program point of view. In your case, it will be the 1.9M, the number in your heap profile (or double the amount due to profiling). And yeah, Haskell executables tend to get extremely large, since libraries are statically linked.
time -l is displaying the (resident, i.e. not swapped out) size of the process as seen by the operating system (obviously). This includes twice the maximum size of the Haskell heap (due to the way that GHC's GC works), plus anything else allocated by the RTS or other C libraries, plus the code of your executable itself plus the libraries it depends on, etc. I'm guessing in this case the primary contributor to the 19M is the size of your exectuable.
total alloc is the total amount allocated onto the Haskell heap. It is not at all a measure of maximum heap size (which is what people usually mean by "how much memory is my program using"). Allocation is very cheap and allocation rates of around 1GB/s are typical for a Haskell program.
The number in the header of the hp2ps output "114,272 bytes x seconds" is something completely different again: it is the integral of the graph, and is measured in bytes * seconds, not in bytes. For example if your program holds onto a 10 MB structure for 4 seconds then that will cause this number to increase by 40 MB*s.
The number around 1.8 MB shown in the graph is the actual maximum size of the Haskell heap, which is probably the number you're most interested in.
You've omitted the most useful source of numbers about your program's execution, which is running it with +RTS -s (this doesn't even require it to have been built with profiling).

How to check usage of different parts of memory?

I have computer with 2 Intel Xeon CPUs and 48 GB of RAM. RAM is divided between CPUs - two parts 24 GB + 24 GB. How can I check how much of each specific part is used?
So, I need something like htop, which shows how fully each core is used (see this example), but rather for memory than for cores. Or something that would specify which part (addresses) of memory are used and which are not.
The information is in /proc/zoneinfo, contains very similar information to /proc/vmstat except broken down by "Node" (Numa ID). I don't have a NUMA system here to test it for you and provide a sample output for a multi-node config; it looks like this on a one-node machine:
Node 0, zone DMA
pages free 2122
min 16
low 20
high 24
scanned 0
spanned 4096
present 3963
[ ... followed by /proc/vmstat-like nr_* values ]
Node 0, zone Normal
pages free 17899
min 932
low 1165
high 1398
scanned 0
spanned 223230
present 221486
nr_free_pages 17899
nr_inactive_anon 3028
nr_active_anon 0
nr_inactive_file 48744
nr_active_file 118142
nr_unevictable 0
nr_mlock 0
nr_anon_pages 2956
nr_mapped 96
nr_file_pages 166957
[ ... more of those ... ]
Node 0, zone HighMem
pages free 5177
min 128
low 435
high 743
scanned 0
spanned 294547
present 292245
[ ... ]
I.e. a small statistic on the usage/availability total followed by the nr_* values also found on a system-global level in /proc/vmstat (which then allow a further breakdown as of what exactly the memory is used for).
If you have more than one memory node, aka NUMA, you'll see these zones for all nodes.
edit
I'm not aware of a frontend for this (i.e. a numa vmstat like htop is a numa-top), but please comment if anyone knows one !
The numactl --hardware command will give you a short answer like this:
node 0 cpus: 0 1 2 3 4 5
node 0 size: 49140 MB
node 0 free: 25293 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 49152 MB
node 1 free: 20758 MB
node distances:
node 0 1
0: 10 21
1: 21 10

Resources