Low CPU, low RAM, low IO, but bad performance, why? - linux

I have an issue with regards to the performance of my Linux Centos Apache server. I have a program (written in c) that does many http requests simultaneously. This process on itself seems very efficiently as if I can do 500 requests to an external server simultaneously without any noticeable time difference compared to only 1 request.
However, I have many scripts on the same server which I run simultaneously using that same program. The number of scripts to run vary, but it is around 100 for a single search.
The task of each script is to call an API (on an external server) parse the data needed an insert that into the database.
I measured the start time of each script and I noticed there is a large delay in the start of each script. There is up to 10 seconds difference between the start of the first script and the start of the last script. This large time delay makes the search on my website slow.
I used top command in my linux centos machine, see below 2 samples of 2 different instances during the search.
TOP COMMAND SAMPLE 1:
top - 18:51:18 up 36 days, 3:35, 1 user, load average: 0.02, 0.07, 0.08
Tasks: 182 total, 2 running, 180 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.7%us, 1.3%sy, 0.0%ni, 94.9%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 4194304k total, 3941184k used, 253120k free, 26820k buffers
Swap: 4194296k total, 76k used, 4194220k free, 2069456k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
691 apache 15 0 190m 18m 5052 S 4.0 0.5 0:00.19 httpd
959 apache 17 0 189m 15m 3196 R 3.0 0.4 0:00.09 httpd
702 apache 15 0 185m 101m 5036 S 1.7 2.5 0:00.86 httpd
732 apache 15 0 184m 12m 5036 S 1.7 0.3 0:00.15 httpd
689 apache 15 0 184m 14m 5144 S 0.7 0.3 0:00.87 httpd
734 apache 15 0 184m 100m 4740 S 0.7 2.4 0:00.21 httpd
670 apache 15 0 205m 99m 4992 S 0.3 2.4 0:00.39 httpd
678 apache 15 0 184m 13m 5032 S 0.3 0.3 0:01.05 httpd
795 root 15 0 12764 1356 956 R 0.3 0.0 0:00.03 top
949 apache 15 0 181m 9616 2928 S 0.3 0.2 0:00.01 httpd
951 apache 20 0 180m 8748 2640 S 0.3 0.2 0:00.01 httpd
1 root 15 0 10372 792 664 S 0.0 0.0 0:00.20 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.14 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.04 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0.0 0.0 0:00.04 events/0
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenbus
37 root 10 -5 0 0 0 S 0.0 0.0 0:00.03 kblockd/0
42 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/0
50 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
52 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
137 root 15 0 0 0 0 S 0.0 0.0 0:00.00 khungtas
TOP COMMAND SAMPLE 2:
top - 18:52:49 up 36 days, 3:36, 1 user, load average: 0.53, 0.21, 0.12
Tasks: 240 total, 8 running, 231 sleeping, 0 stopped, 1 zombie
Cpu(s): 50.4%us, 4.8%sy, 0.0%ni, 43.5%id, 0.5%wa, 0.0%hi, 0.5%si, 0.2%st
Mem: 4194304k total, 4097104k used, 97200k free, 27148k buffers
Swap: 4194296k total, 76k used, 4194220k free, 1965428k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
949 apache 16 0 185m 101m 5328 S 32.1 2.5 0:01.19 httpd
1229 apache 16 0 184m 12m 4580 S 32.1 0.3 0:00.98 httpd
968 apache 17 0 188m 17m 4732 S 30.4 0.4 0:01.92 httpd
1244 apache 17 0 184m 12m 4580 S 27.8 0.3 0:00.86 httpd
994 apache 16 0 190m 19m 5060 S 27.5 0.5 0:01.69 httpd
1222 apache 16 0 218m 44m 4676 R 26.5 1.1 0:00.82 httpd
1657 mysql 15 0 627m 223m 5664 S 23.5 5.5 65:16.63 mysqld
1256 apache 16 0 184m 12m 4580 S 21.2 0.3 0:00.81 httpd
1245 apache 16 0 210m 37m 4084 R 14.6 0.9 0:00.47 httpd
1005 apache 16 0 213m 42m 5308 R 13.6 1.0 0:00.67 httpd
1246 apache 17 0 184m 12m 4580 S 11.3 0.3 0:00.74 httpd
1214 apache 16 0 182m 10m 4060 S 3.3 0.3 0:00.23 httpd
1253 apache 16 0 184m 12m 4580 S 2.3 0.3 0:00.67 httpd
1233 apache 15 0 196m 22m 3696 R 2.0 0.6 0:00.17 httpd
1215 apache 15 0 183m 11m 4060 S 1.7 0.3 0:00.18 httpd
1265 apache 15 0 182m 11m 3444 S 1.7 0.3 0:00.05 httpd
1230 apache 16 0 180m 9644 3436 S 1.3 0.2 0:00.04 httpd
1210 apache 15 0 192m 19m 3620 S 1.0 0.5 0:00.14 httpd
1011 apache 15 0 193m 22m 5356 R 0.7 0.5 0:00.86 httpd
1016 apache 15 0 192m 19m 4092 S 0.7 0.5 0:00.20 httpd
1019 apache 15 0 192m 21m 4972 S 0.7 0.5 0:01.27 httpd
1051 root 15 0 12896 1424 956 R 0.7 0.0 0:00.10 top
1221 apache 15 0 180m 9820 3436 S 0.7 0.2 0:00.03 httpd
989 apache 15 0 193m 21m 5332 R 0.3 0.5 0:01.06 httpd
1000 apache 15 0 208m 102m 5424 S 0.3 2.5 0:00.97 httpd
1032 apache 15 0 190m 18m 4748 S 0.3 0.4 0:00.39 httpd
1213 apache 15 0 0 0 0 Z 0.3 0.0 0:00.15 httpd <defunct>
1251 apache 15 0 184m 11m 3700 S 0.3 0.3 0:00.02 httpd
1 root 15 0 10372 792 664 S 0.0 0.0 0:00.20 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.14 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.04 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0.0 0.0 0:00.04 events/0
I also had a look at the disk IO, there are a lot of processes, but all together there isn't much IO:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2804 be/3 root 0.00 B/s 0.00 B/s 0.00 % 1.13 % [ib_cm/1]
32387 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.75 % httpd -k start -DSSL
32344 be/4 apache 0.00 B/s 3.77 K/s 0.00 % 0.75 % httpd -k start -DSSL
32465 be/4 apache 0.00 B/s 0.00 B/s -0.75 % 0.75 % httpd -k start -DSSL
32487 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.75 % httpd -k start -DSSL
32377 be/4 apache 0.00 B/s 3.77 K/s 0.00 % 0.38 % httpd -k start -DSSL
32462 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.38 % httpd -k start -DSSL
32469 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.38 % httpd -k start -DSSL
32445 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.38 % httpd -k start -DSSL
32349 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32436 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % python /usr/bin/iotop
32385 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32382 be/4 apache 0.00 B/s 3.77 K/s 0.38 % 0.00 % httpd -k start -DSSL
32446 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32381 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32375 be/4 apache 0.00 B/s 0.00 B/s 0.75 % 0.00 % httpd -k start -DSSL
32312 be/4 apache 0.00 B/s 3.77 K/s 0.00 % 0.00 % httpd -k start -DSSL
32342 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
1407 be/4 dbus 0.00 B/s 0.00 B/s 0.00 % 0.00 % dbus-daemon --system
32455 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32466 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32470 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32488 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
25045 be/4 dovecot 0.00 B/s 0.00 B/s 0.00 % 0.00 % dovecot/pop3-login
4 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]
32514 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32581 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32531 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32471 be/4 apache 0.00 B/s 0.00 B/s 0.38 % 0.00 % httpd -k start -DSSL
32546 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32519 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32521 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32315 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32523 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32524 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32335 be/4 apache 0.00 B/s 3.77 K/s 0.00 % 0.00 % httpd -k start -DSSL
32583 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32566 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32526 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32481 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32557 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32529 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32530 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32541 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32507 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32570 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
32504 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 0.00 % httpd -k start -DSSL
I also tested with different settings of my httpd.conf:
TEST HTTPD.CONF 1:
<IfModule mpm_prefork_module>
StartServers 2
MinSpareServers 2
MaxSpareServers 5
ServerLimit 200
MaxClients 200
MaxRequestsPerChild 100
</IfModule>
TEST HTTPD.CONF 2:
<IfModule mpm_prefork_module>
StartServers 2
MinSpareServers 2
MaxSpareServers 5
ServerLimit 2
MaxClients 2
MaxRequestsPerChild 1
</IfModule>
The above doesn't seem to have any impact on the performance.
I also checked if the mpm module is installed (and this seems to be the case), see below. However I really don't understand how the above settings don't have any impact on the performance of my server.
-bash-3.2# httpd -M
[Wed Mar 11 18:59:42 2015] [warn] module php5_module is already loaded, skipping
Loaded Modules:
core_module (static)
authn_file_module (static)
authn_default_module (static)
authz_host_module (static)
authz_groupfile_module (static)
authz_user_module (static)
authz_default_module (static)
auth_basic_module (static)
reqtimeout_module (static)
include_module (static)
filter_module (static)
deflate_module (static)
log_config_module (static)
logio_module (static)
env_module (static)
expires_module (static)
headers_module (static)
unique_id_module (static)
setenvif_module (static)
version_module (static)
proxy_module (static)
proxy_connect_module (static)
proxy_ftp_module (static)
proxy_http_module (static)
proxy_scgi_module (static)
proxy_ajp_module (static)
proxy_balancer_module (static)
ssl_module (static)
mpm_prefork_module (static)
http_module (static)
mime_module (static)
dav_module (static)
status_module (static)
autoindex_module (static)
asis_module (static)
suexec_module (static)
cgi_module (static)
dav_fs_module (static)
dav_lock_module (static)
negotiation_module (static)
dir_module (static)
actions_module (static)
userdir_module (static)
alias_module (static)
rewrite_module (static)
so_module (static)
php5_module (shared)
Syntax OK
I am really lost in my search to what could be the bottleneck for this performance. Important to not is that when I put exit in the top of all 100 scripts there is no time delay, but when the scripts do all the curl requests and data parsing there is a large time delay.
All help is very welcome!

If there are about 100 external requests per operation, that will make your script slow. No matter what level of efficiency you introduce to your server/script those 100 request have to share the network bandwidth. Also there are more complex things that go on the network level which make it slower than ideal theoretical value.
If you are having a repeated requests for the same url, you might be able to cache the response for a short span of time depending on the rate of change of data in the source.
Say for example out of 100 requests you do for an operation if you are able to get even about 30-40% of the response from the cache that will speed up your script considerably.

Related

Docker daemon keeps writing disk and cause shell commands to run slowly

My sandbox computer is very very slow recently, and after digging into it I found the docker daemon is writing to disk frequently as below
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
25321 be/4 root 0.00 B/s 297.06 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
25344 be/4 root 0.00 B/s 246.24 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
25351 be/4 root 0.00 B/s 148.53 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
25352 be/4 root 0.00 B/s 328.32 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
25514 be/4 root 0.00 B/s 555.03 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
25536 be/4 root 0.00 B/s 343.96 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
I tried to stop the docker, and everything comes normal again. But when I restart the docker, it turns into the slow state, shell commands show up the outputs slowly.
Why would docker daemon keeps writing disk? how to prevent it from writing disk?
My docker version is 19.03.
To find out which files these Docker processes are writing to, you could use the strace comand:
strace -e trace=file -p <PID>
The process ID (PID) is the same as the "TID" in your iotop output.

How to get second-level output from sar when used with -f option?

sar man page says that one can specify the resolution in seconds for its output.
However, I am not able to get a second level resolution by the following command.
sar -i 1 -f /var/log/sa/sa18
11:00:01 AM CPU %user %nice %system %iowait %steal %idle
11:10:01 AM all 0.04 0.00 0.04 0.00 0.01 99.91
11:20:01 AM all 0.04 0.00 0.04 0.00 0.00 99.92
11:30:01 AM all 0.04 0.00 0.04 0.00 0.00 99.92
Following command too does not give second level resolution:
sar -f /var/log/sa/sa18 1
I am able to get second-level result only if I do not specify the -f option:
sar 1 10
08:34:31 PM CPU %user %nice %system %iowait %steal %idle
08:34:32 PM all 0.12 0.00 0.00 0.00 0.00 99.88
08:34:33 PM all 0.00 0.00 0.12 0.00 0.00 99.88
08:34:34 PM all 0.00 0.00 0.12 0.00 0.00 99.88
But I want to see system performance varying by second for some past day.
How do I get sar to print second-level output with the -f option?
Linux version: Linux 2.6.32-642.el6.x86_64
sar version : sysstat version 9.0.4
I think the exist sar report file 'sa18' collected with an interval 10 mins. So we don't get the output in seconds.
Please check the /etc/cron.d/sysstat file.
[root#testserver ~]# cat /etc/cron.d/sysstat
#run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 1 1
#generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A
If you want to reduce the sar interval interval you can modify the sysstat file.
The /var/log/sa directory has all of the information already.
The sar command serves here as a parser, and reads all data in the sa file.
So you can use sar -f /var/log/sa/<sa file> to see first-level results, and use other flags, like '-r', for other results.
# sar -f /var/log/sa/sa02
12:00:01 CPU %user %nice %system %iowait %steal %idle
12:10:01 all 14.70 0.00 5.57 0.69 0.01 79.03
12:20:01 all 23.53 0.00 6.08 0.55 0.01 69.83
# sar -r -f /var/log/sa/sa02
12:00:01 kbmemfree kbavail kbmemused kbactive kbinact kbdirty
12:10:01 2109732 5113616 30142444 25408240 2600
12:20:01 1950480 5008332 30301696 25580696 2260
12:30:01 2278632 5324260 29973544 25214788 4112

Total CPU usage - multicore system

I am using xen and with xen top I get the total CPU usage in percentage:
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
VM1 -----r 25724 299.4 3025244 12.0 20975616 83.4 12 1 14970253 27308358 1 3 146585 92257 10835706 9976308 0
As you can see from above I see the CPU usage is 299 %, but how I can get the total CPU usage from a VM ?
Top doesn't show me the total usage.
We usually see 100% cpu per core.
I guess there are at least 3 cores/cpus.
try this to count cores:
grep processor /proc/cpuinfo | wc -l
299% is the total cpu usage.
sar and mpstat are often used to display cpu usage of a system. Check that systat package is installed and display total cpu usage with:
$ mpstat 1 1
Linux 2.6.32-5-amd64 (debian) 05/01/2016 _x86_64_ (8 CPU)
07:48:51 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
07:48:52 PM all 0.12 0.00 0.50 0.00 0.00 0.00 0.00 0.00 99.38
Average: all 0.12 0.00 0.50 0.00 0.00 0.00 0.00 0.00 99.38
If you agree that CPU utilisation is (100 - %IDLE):
$ mpstat 1 1 | awk '/^Average/ {print 100-$NF,"%"}'
0.52 %

Linux "top" command - want to aggregate resource usage to the process group or user name, especially for postgres

An important topic in software deveopment / programming is to assess the size of the product, and to match the application footprint to the system where it is running. One may need to optimize the product, and/or one may need to add more memory, use a faster processor, etc. In the case of virtual machines, it is important to make sure the application will work effectively by perhaps making the VM memory size larger, or allow a product to get more resources from the hypervisor when needed and available.
The linux top(1) command is great, with its ability to sort by different fields, add optional fields, highlight sort criteria on-screen, and switch sort field with < and >. On most systems though, there are very many processes running, making "at-a-glance" examination a little difficult. Consider:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID SWAP nFLT COMMAND
2181 root 20 0 7565m 3.2g 7028 S 2.7 58.3 86:41.17 1 317m 10k java
1751 root 20 0 137m 2492 1056 S 0.0 0.0 0:02.57 1 5104 76 munin-node
11598 postgres 20 0 146m 23m 11m S 0.0 0.4 7:51.63 2143 3600 28 postmaster
1470 root 20 0 243m 1792 820 S 0.0 0.0 0:01.89 1 2396 23 rsyslogd
3107 postgres 20 0 146m 26m 11m S 0.0 0.5 7:40.61 2143 936 58 postmaster
3168 postgres 20 0 132m 14m 11m S 0.0 0.2 8:27.27 2143 904 53 postmaster
3057 postgres 20 0 138m 19m 11m S 0.0 0.3 6:55.63 2143 856 36 postmaster
3128 root 20 0 85376 900 896 S 0.0 0.0 0:00.11 1636 852 2 sshd
1728 root 20 0 80860 1080 952 S 0.0 0.0 0:00.61 1 776 0 master
3130 manager 20 0 85532 844 672 S 0.0 0.0 0:01.03 3128 712 36 sshd
436 root 16 -4 11052 264 260 S 0.0 0.0 0:00.01 1 688 0 udevd
2211 root 18 -2 11048 220 216 S 0.0 0.0 0:00.00 436 684 0 udevd
2212 root 18 -2 11048 220 216 S 0.0 0.0 0:00.00 436 684 0 udevd
1636 root 20 0 66176 524 436 S 0.0 0.0 0:00.12 1 620 25 sshd
1486 root 20 0 229m 2000 1648 S 0.0 0.0 0:00.79 1485 596 116 sssd_be
2306 postgres 20 0 131m 11m 9m S 0.0 0.2 0:01.21 2143 572 64 postmaster
3055 postgres 20 0 135m 16m 11m S 0.0 0.3 10:18.88 2143 560 36 postmaster
...etc... This is for about 20 processes, but there are well over 100 processes.
In this example I was sorting by SWAP field.
I would like to be able to aggregate related processes based on the "process group" of which they are a part, or based on the USER running the process, or based on the COMMAND being run. Essentially I want to:
Aggregate by PPID, or
Aggregate by USER, or
Aggregate by COMMAND, or
Turn off aggregation
This would allow me to see more quickly what is going on. The expectation is that all the postgres processes would show up together, as a single line, with process group leader (2143, not captured in the snippet) displaying aggegated metrics. Generally the aggregation would be a sum (VIRT, RES, SHR, %CPU, %MEM, TIME+, SWAP, nFLT), but sometimes not (as for PR and NI, which might be shown as just --).
For processes whose PPID is 1, it would be nice to have an option of toggling between aggregating them all together, or of leaving them listed individually.
Aggegation by the name of the process (java vs. munin-node, vs. postmaster, vs. chrome) would also be a nice option. The COMMAND arguments would not be used when aggregating by command name.
This would be very valuable when tuning an application. How can I do this, aggregating top data for at-a-glance viewing in larger scale systems? Has anyone written an app, perhaps that uses top in batch mode, to create a summary view like I'm discussing?
FYI, I'm specifically interest in something for CentOS, but this would be helpful on any OS variant.
Thanks!
...Alan

GNU parallel load balancing

I am trying to find a way to execute CPU intensive parallel jobs over a cluster. My objective is to schedule one job per core, so that every job hopefully gets 100% CPU utilization once scheduled. This is what a have come up with so far:
FILE build_sshlogin.sh
#!/bin/bash
serverprefix="compute-0-"
lastserver=15
function worker {
server="$serverprefix$1";
free=$(ssh $server /bin/bash << 'EOF'
cores=$(grep "cpu MHz" /proc/cpuinfo | wc -l)
stat=$(head -n 1 /proc/stat)
work1=$(echo $stat | awk '{print $2+$3+$4;}')
total1=$(echo $stat | awk '{print $2+$3+$4+$5+$6+$7+$8;}')
sleep 2;
stat=$(head -n 1 /proc/stat)
work2=$(echo $stat | awk '{print $2+$3+$4;}')
total2=$(echo $stat | awk '{print $2+$3+$4+$5+$6+$7+$8;}')
util=$(echo " ( $work2 - $work1 ) / ($total2 - $total1) " | bc -l );
echo " $cores * (1 - $util) " | bc -l | xargs printf "%1.0f"
EOF
)
if [ $free -gt 0 ]
then
echo $free/$server
fi
}
export serverprefix
export -f worker
seq 0 $lastserver | parallel -k worker {}
This script is used by GNU parallel as follows:
parallel --sshloginfile <(./build_sshlogin.sh) --workdir $PWD command args {1} ::: $(seq $runs)
The problem with this technique is that if someone starts another CPU intensive job on a server in the cluster, without checking the CPU usage, then the script will end up scheduling jobs to a core that is being used. In addition, if by the time the first jobs finishes, the CPU usage has changed, then the newly freed cores will not be included for scheduling by GNU parallel for the remaining jobs.
So my question is the following: Is there a way to make GNU parallel re-calculate the free cores/server before it schedules each job? Any other suggestions for solving the problem are welcome.
NOTE: In my cluster all cores have the same frequency. If someone can generalize to account for different frequencies, that's also welcome.
Look at --load which is meant for exactly this situation.
Unfortunately it does not look at CPU utilization but load average. But if your cluster nodes do not have heavy disk I/O then CPU utilization will be very close to load average.
Since load average changes slowly you probably also need to use the new --delay option to give the load average time to rise.
Try mpstat
mpstat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
10:25:32 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
10:25:32 PM all 5.68 0.00 0.49 2.03 0.01 0.02 0.00 91.77 146.55
This is an overall snapshot on a per core basis
$ mpstat -P ALL
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 _x86_64_ (4 CPU)
10:28:04 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
10:28:04 PM all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99
10:28:04 PM 0 0.01 0.00 0.01 0.01 0.00 0.00 0.00 0.00 99.98
10:28:04 PM 1 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.98
10:28:04 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:28:04 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
There lot of options, these two give a simple actual %idle per cpu. Check the manpage.

Resources