I have Debian jessie installed.
kernel version:
Linux srv1 3.16-3-amd64 #1 SMP Debian 3.16.5-1 (2014-10-10) x86_64 GNU/Linux
It has 32Gb memory installed. It seems more then enough for my task.
Heavy utilized asterisk process leak a lot of memory and produce a lot of troubles.
Asterisk itself and bash from time to time reports "unable to allocate memory".
At the same time based on attached top report server has 7 Gb unused memory.
It will be great if someone will help to figure out what is wrong:
- what kind of resources were exhausted
- what need to be tuned for 100% server resources utilization.
Top:
Tasks: 130 total, 1 running, 129 sleeping, 0 stopped, 0 zombie
%Cpu0 : 6,0 us, 1,3 sy, 0,0 ni, 21,5 id, 70,8 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu1 : 70,2 us, 0,3 sy, 0,0 ni, 24,8 id, 4,6 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 3,0 us, 0,7 sy, 0,0 ni, 84,6 id, 11,7 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 6,9 us, 0,7 sy, 0,0 ni, 78,2 id, 14,2 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 3,3 us, 0,7 sy, 0,0 ni, 84,3 id, 11,7 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu5 : 4,0 us, 0,7 sy, 0,0 ni, 90,1 id, 5,3 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem: 32985292 total, 25834636 used, 7150656 free, 38312 buffers
KiB Swap: 58592252 total, 1767420 used, 56824832 free. 37988 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7615 asterisk 20 0 3147628 2,813g 1820 S 69,8 8,9 5:35.84 php
2389 asterisk 20 0 20,150g 1,207g 2176 S 28,5 3,8 247:42.19 asterisk
976 mysql 20 0 1411844 19392 2624 S 1,3 0,1 15:13.28 mysqld
21651 root 20 0 24876 2824 2316 R 0,7 0,0 0:02.08 top
...
Your server is using the memory solely for application data. In your top excerpt the buffers and cached are very low. Since asterisk is probably not very disk intensive, that seems fine. But your swap is also being used. This contradicts your assumption that 32GB memory seems enough.
It would be a good idea to install the sysstat package to monitor your system on what is really going on. Top shows only the current memory and process information. sysstat with the included sar command records system information every few minutes to retrieve it later for analysis.
Related
I am checking the impact of Linux's sched_rt_runtime_us.
My understanding of the Linux RT scheduling is sched_rt_period_us defines scheduling period of RT process, and sched_rt_runtime_us defines how much the RT process can run within that period.
In my Linux-4.18.20, the kernel.sched_rt_period_us = 1000000, kernel.sched_rt_runtime_us = 950000, so in each second, 95% time is used by RT process, 5% is for SCHED_OTHER processes.
By changing the kernel.sched_rt_runtime_us, the CPU usage of RT process shown in top should be proportional with sched_rt_runtime_us/sched_rt_period_us.
But my testing does NOT get the expected results, and what I got is as follows,
%CPU
kernel.sched_rt_runtime_us = 50000
2564 root rt 0 4516 748 684 R 19.9 0.0 0:37.82 testsched_top
kernel.sched_rt_runtime_us = 100000
2564 root rt 0 4516 748 684 R 40.5 0.0 0:23.16 testsched_top
kernel.sched_rt_runtime_us = 150000
2564 root rt 0 4516 748 684 R 60.1 0.0 0:53.29 testsched_top
kernel.sched_rt_runtime_us = 200000
2564 root rt 0 4516 748 684 R 80.1 0.0 1:24.96 testsched_top
The testsched_top is a SCHED_FIFO process with priority 99, and it is running in an isolated CPU.
The cgroup is configured in grub.cfg as cgroup_disable=cpuset,cpu,cpuacct to disable CPU related stuff.
I don't know why this happens, is there anything missing or wrong in my testing and understanding of Linux SCHED_FIFO scheduling?
N.B.: I am running this in Ubuntu VM, which is configured with 8 vCPUs, in which 4-7 are isolated to run RT processes. The host is Intel X86_64 with 6Cores (12 Threads), and there is NO other VMs running in the host. The above testsched_top was copied from https://viviendolared.blogspot.com/2017/03/death-by-real-time-scheduling.html?m=0, it sets priority 99 for SCHED_FIFO and loops indefinitely in one isolated CPU. I checked that isolated CPU usage, and got above results. –
I think I got the answer, and thank Rachid for the question.
In short, the kernel.sched_rt_period_us is the sum of RT time slice in a group of CPUs.
For example, in my 8vCPU VM configuration, CPU4-7 are isolated for running specific processes. So the kernel.sched_rt_period_us should be evenly divided among these 4 isolated CPUs, which means kernel.sched_rt_period_us/4 = 250000 is 100% CPU quota for each CPU in the isolated group. Setting kernel.sched_rt_period_us to 250000 makes the SCHED_FIFO process take all of the CPU. Accordingly, 25000 means 10% CPU usage for the CPU, 50000 means 20%, etc.
This is validated when CPU6 and CPU7 are isolated, in this case, 500000 can make the CPU to be 100% used by SCHED_FIFO process, 250000 makes 50% CPU usage.
Since these two kernel parameters are global ones, which means if the SCHED_FIFO process is put into the CPU0-5, 1000000/6 = 166000 should be the 100% quota for each CPU, 83000 makes 50% CPU usage, I also validated this.
Here is the snapshot of top,
%Cpu4 : 49.7 us, 0.0 sy, 0.0 ni, 50.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16422956 total, 14630144 free, 964880 used, 827932 buff/cache
KiB Swap: 1557568 total, 1557568 free, 0 used. 15245156 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3748 root rt 0 4516 764 700 R 49.5 0.0 30:21.03 testsched_top
In linux, I'm writting a script to log system parameters to a file.
How can I get the name of the task consuming the most CPU resources, and the percentage of CPU used by that task?
For example, using top:
$ top -bin 1
top - 19:11:05 up 2:57, 1 user, load average: 1,43, 1,47, 1,06
Tasks: 178 total, 2 running, 124 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5,8 us, 1,3 sy, 0,0 ni, 92,8 id, 0,0 wa, 0,0 hi, 0,1 si, 0,0 st
KiB Mem : 3892704 total, 1594348 free, 1282992 used, 1015364 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 2335136 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11883 root 20 0 645964 104036 87792 R 93,8 2,7 18:07.03 Xorg
12030 raf 20 0 412824 35632 14860 S 12,5 0,9 2:44.51 xfsettingsd
23468 raf 20 0 39648 3864 3332 R 6,2 0,1 0:00.02 top
From the exammple above, what I would like to have is a [sequence of [piped]] bash command[s] that outputs:
93.8 Xorg
You can try
ps -eo %cpu,comm --sort %cpu | tail -n 1
Following top-table is given:
With following command:
top -bn1p 20101
I get following result:
top - 11:38:34 up 248 days, 1:17, 3 users, load average: 0,09, 0,16, 0,18
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1,8 us, 0,9 sy, 0,0 ni, 97,2 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem: 24693872 total, 24430392 used, 263480 free, 142532 buffers
KiB Swap: 15625212 total, 17508 used, 15607704 free. 12526360 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20101 root 20 0 11,636g 262944 18260 S 0,0 1,1 88:13.84 java
Like you see CPU-Value is 0.0 . Why? It should be 0.3.
Is this a bug?
I have a question about Apache Spark. I set up an Apache Spark standalone cluster on my Ubuntu desktop. Then I wrote two lines in the spark_env.sh file: SPARK_WORKER_INSTANCES=4 and SPARK_WORKER_CORES=1. (I found that export is not necessary in spark_env.sh file if I start the cluster after I edit the spark_env.sh file.)
I wanted to have 4 worker instances in my single desktop and let them occupy 1 CPU core each. And the result was like this:
top - 14:37:54 up 2:35, 3 users, load average: 1.30, 3.60, 4.84
Tasks: 255 total, 1 running, 254 sleeping, 0 stopped, 0 zombie
%Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 1.7 us, 0.3 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 41.6 us, 0.0 sy, 0.0 ni, 58.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 59.0 us, 0.0 sy, 0.0 ni, 41.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16369608 total, 11026436 used, 5343172 free, 62356 buffers
KiB Swap: 16713724 total, 360 used, 16713364 free. 2228576 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10829 aaaaaa 20 0 42.624g 1.010g 142408 S 101.2 6.5 0:22.78 java
10861 aaaaaa 20 0 42.563g 1.044g 142340 S 101.2 6.7 0:22.75 java
10831 aaaaaa 20 0 42.704g 1.262g 142344 S 100.8 8.1 0:24.86 java
10857 aaaaaa 20 0 42.833g 1.315g 142456 S 100.5 8.4 0:26.48 java
1978 aaaaaa 20 0 1462096 186480 102652 S 1.0 1.1 0:34.82 compiz
10720 aaaaaa 20 0 7159748 1.579g 32008 S 1.0 10.1 0:16.62 java
1246 root 20 0 326624 101148 65244 S 0.7 0.6 0:50.37 Xorg
1720 aaaaaa 20 0 497916 28968 20624 S 0.3 0.2 0:02.83 unity-panel-ser
2238 aaaaaa 20 0 654868 30920 23052 S 0.3 0.2 0:06.31 gnome-terminal
I think java in the first 4 lines are Spark workers. If it's correct, it's nice that there are four Spark workers and each of them are using 1 physical core each (e.g., 101.2%).
But I see that 5 physical cores are used. Among them, CPU0, CPU3, CPU7 are fully used. I think one Spark worker is using one of those physical cores. It's fine.
However, the usage levels of CPU2 and CPU6 are 41.6% and 59.0%, respectively. They add up to 100.6%, and I think one worker's job is distributed to those 2 physical cores.
With SPARK_WORKER_INSTANCES=4 AND SPARK_WORKER_CORES=1, is this a normal situation? Or is this a sign of some errors or problems?
This is perfectly normal behavior. Whenever Spark uses term core it actually means either process or thread and neither one is bound to a single core or processor.
In any multitasking environment processes are not executed continuously. Instead, operating system is constantly switching between different processes which each one getting only small share of available processor time.
i'm running a CentOS 7.2 VM on Azure and get a CPU stuck kernel-bug warning. top shows that CPU#0 is 100% in use.
[admin#bench2 ~]$
Message from syslogd#bench2 at Feb 9 10:06:43 ...
kernel:BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u128:1:13777]
This is the topoutput:
Tasks: 258 total, 7 running, 251 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 28813448 total, 26938144 free, 653860 used, 1221444 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 27557900 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
73 root 20 0 0 0 0 S 0.7 0.0 1:03.03 rcu_sched
1 root 20 0 43668 6204 3796 S 0.0 0.0 0:04.70 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root 20 0 0 0 0 R 0.0 0.0 0:00.10 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
Centos + Kernel Version:
CentOS Linux release 7.1.1503 (Core)
Linux bench2 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
If noticed that this error also appears on CentOS 7.2 versions.
[84176.089080] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u128:1:13777]
[84176.089080] Modules linked in: vfat fat isofs xfs libcrc32c iptable_filter ip_tables udf crc_itu_t hyperv_fb hyperv_keyboard hv_utils i2c_piix4 i2c_core serio_raw pcspkr crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_common hv_netvsc hv_storvsc hid_hyperv sr_mod cdrom ata_generic pata_acpi ata_piix libata floppy hv_vmbus
[84176.089080] CPU: 0 PID: 13777 Comm: kworker/u128:1 Tainted: G W -------------- 3.10.0-229.7.2.el7.x86_64 #1
[84176.089080] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012
If this version does problems on Azure it is no problem to switch it. If this is the case, I want to know which CentOS version would be the best to run on an Azure environment.
I solved the problem by setting the Host caching on VHD to None. Odd behaviour but it works.
see screen here
I had the same issue, it's a disk performance issue (high IOPS/Latency etc.), not related to CPU or RAM (at least in my case).
The storage (NetApp) was very loaded, I solve it by moving to SSD, even using a large raid group with HDD (without a special load) didn't help.
We used a K8S setup, but I saw it on lot of CentOS with simple applications as well.
Regards,