LAMP on CenOS 6 sporadic timeouts - linux

We have a few servers recently moved to new provider (well-known, Germany one).
Configuration are same, those are i7-2600 CPU, 16Gb RAM machines, 1Gbit cards (conneted to router at 100Mbit)
OS is Centos 6, Application is LAMP (Apache 2.2.15, PHP 5.3.8 (APC 3.1.9), MySQL 5.5.18, Memcached daemons running on each machine)
PHP pages called by proxy-component written in Java (100-300 times/sec depends on users number)
There is no any swapping on servers, no warnings in /var/log/messages, Load average is about 0.5-1.0 on application servers
and 2.0 - 3.0 at MySQL. There no bottlenecks in application (we are gathering metrics, standart time needed for rendering
responce always around 0.015 seconds)
The problem is following: sporadically, we seeing timeouts in proxy-component going in row during 2-3 seconds.
Often timeouts equals to 3000, sometimes 9000 and rarely to 21000 milliseconds (this is somehow connected to SYN-packets?)
This even happens if proxy components placed on same machine with PHP-application (Apache+PHP)
We also noticed that:
threads on Mysql are during this timeouts have 'Reading from net'
statuses.
During timeouts Apache "status" page fills quickly (1-3 seconds) with 'W' processes (so all processes became in 'W', some in 'C' statuses)
Timeouts mostly appears when traffic increasing (evening), and this
problem disappears when traffic starts going down (evening->night)
During timeouts Load average increases to 5.0 - 20.0
Things which I tried and they do not help:
I played a lot with sysctl/net variables (somaxconn, buffers,
this does not help)
Turning off firewall
Turning off APC
(disabled it's usage in code)
Switching to persistent connections (in PHP) and from MySQL to MySQLi
Just now I found that iperf showing drop down in bandwidth during timeouts:
------------------------------------------------------------
Client connecting to localhost, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 122 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 54006 connected with 127.0.0.1 port 5001
[ ID] Interval Transfer Bandwidth
...
[ 3] 266.0-266.5 sec 24.0 MBytes 402 Mbits/sec
[ 3] 266.5-267.0 sec 24.4 MBytes 410 Mbits/sec
[ 3] 267.0-267.5 sec 24.0 MBytes 402 Mbits/sec
[ 3] 267.5-268.0 sec 24.4 MBytes 410 Mbits/sec
[ 3] 268.0-268.5 sec 24.0 MBytes 402 Mbits/sec
[ 3] 268.5-269.0 sec 18.6 MBytes 312 Mbits/sec
[ 3] 269.0-269.5 sec 2.42 MBytes 40.6 Mbits/sec
[ 3] 269.5-270.0 sec 7.87 MBytes 132 Mbits/sec
[ 3] 270.0-270.5 sec 2.30 MBytes 38.5 Mbits/sec
[ 3] 270.5-271.0 sec 2.84 MBytes 47.7 Mbits/sec
[ 3] 271.0-271.5 sec 5.59 MBytes 93.8 Mbits/sec
[ 3] 271.5-272.0 sec 3.42 MBytes 57.4 Mbits/sec
[ 3] 272.0-272.5 sec 2.83 MBytes 47.5 Mbits/sec
[ 3] 272.5-273.0 sec 13.5 MBytes 227 Mbits/sec
[ 3] 273.0-273.5 sec 24.2 MBytes 407 Mbits/sec
[ 3] 273.5-274.0 sec 24.1 MBytes 404 Mbits/sec
[ 3] 274.0-274.5 sec 24.3 MBytes 408 Mbits/sec
...
Notice, that only iperf client was launched with "iperf -c localhost -i0.5 -b5000000000 -t3000" command.
What is the issue which leads to such timeouts? Is this CentOS-related ?
Thanks,
Arsen

Related

The host memory displayed on the CDH is inconsistent with that queried with the top command

When I was about to clean up the memory of the Linux host, I used the top command to check the memory usage, and found that the result of the query was inconsistent with the host memory displayed by CDH.
and i don't know why and how do CDH get the memory of host
CDH version is: 6.3.2(pracel)
Tasks: 659 total, 1 running, 655 sleeping, 2 stopped, 1 zombie
%Cpu(s): 9.7 us, 2.0 sy, 0.2 ni, 87.9 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
GiB Mem : 125.2 total, 4.9 free, 84.3 used, 36.0 buff/cache
GiB Swap: 34.0 total, 24.4 free, 9.6 used. 28.5 avail Mem
the cdh display
96.9 GiB / 125.2 GiB

WHM Server receiving lots of "FAILED: cphulk"

I have a WHM server on GoDaddy.
I'm receiving quite a lot (3-4 a day) mails about a process failing and recovering itself. Happens mostly to "cphulkd" but also to "lfd".
My server:
WHM version v68.0.33. Contains two websites (One Moodle and one Wordpress). 2GB Ram, 60GB HD.
This is the whole mail:
Server s50-62-22-123.secureserver.net Primary IP Address
50.62.22.123 Service Name cphulkd Service Status failed ⛔ Notification The service “cphulkd” appears to be down. Service Check
Method The system’s command to check or to restart this service
failed. Number of Restart Attempts 1 Service Check Raw Output (XID
ejd2e7) The “cphulkd” service is down.
The subprocess “/usr/local/cpanel/scripts/restartsrv_cphulkd” reported
error number 255 when it ended. Startup Log Starting cPHulkd...
Started. Starting PID 3789: cPhulkd - processor - dormant mode -
accepting connections Memory Information Used 2.43 GB Available
1.57 GB Installed 4 GB Load Information 0.17 0.19 0.18 Uptime 2 days, 18 hours, 59 minutes, and 37 seconds IOStat Information
avg-cpu: %user %nice %system %iowait %steal %idle
0.62 0.11 0.12 0.17 0.00 98.99 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn Top Processes
PID Owner CPU % Memory % Command 18850 root 2.45 2.29 spamd
child 3452 root 0.94 2.35
/usr/local/cpanel/3rdparty/perl/524/bin/perl -T -w
/usr/local/cpanel/3rdparty/bin/spamd --max-spare=1 --max-children=3
--allowed-ips=127.0.0.1,::1 --pidfile=/var/run/spamd.pid --listen=5 1488 mysql 0.52 7.49 /usr/sbin/mysqld --basedir=/usr
--datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --log-error=s50-62-22-179.secureserver.net.err --open-files-limit=10000 --pid-file=/var/lib/mysql/s50-62-22-179.secureserver.net.pid 18854 dovecot 0.31 0.06 dovecot/auth 20291 root 0.07 0.71 lfd -
sleeping
Any ideas?
What's weird is that the mail says I have 4GB but I only have 2GB..

ubuntu 14.04.1 server idle load average 1.00

Scratching my head here. Hoping someone can help me troubleshoot.
I have a Dell PowerEdge SC1435 server which had been running with a previous version of ubuntu for a while. (I believe it was 13.10 server x64)
I recently reformatted the drive (SSD) and installed ubuntu server 14.04.1 x64.
All seemed fine through the install but the machine hung on first boot at the end of the kernel output, just before I would expect the screen to clear and a logon prompt appear. There were no obvious errors at the end of the kernel output that I saw. (There was a message about "not using cpu thermal sensor that is unreliable" but that appears to be there regardless of whether it boots or not)
I gave it a good 5 minutes and then forced a reboot. To my surprise it booted to the logon prompt in about 1-2 seconds after bios post. I rebooted again and it seemed to pause for a few extra seconds where it hung before, but proceeded to the login screen. Rebooting again it was fast again. So at this point I thought it was just one of those random one-off glitches that I would never explain so I moved on.
I installed a few packages (exact same packages installed on the same OS version on other hardware), did apt upgrade and dist-upgrade then rebooted. It seemed to hang again so I drove to the datacentre and connected a console only to get a blank screen. Forced reboot again. (also setup ipmi for remote rebooting and got rid of the grub recordfail so it would not wait for me to press enter!)
That was very late last night. I came home, did a few reboots with no issue so went to bed.
Today I did a reboot again to check it and again it crashed somewhere. I remotely force rebooted it.
As this point I started digging a little more and immediately noticed something really strange.
top - 14:18:35 up 8 min, 1 user, load average: 1.00, 0.85, 0.45
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 0.3 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 33013620 total, 338928 used, 32674692 free, 9740 buffers
KiB Swap: 3906556 total, 0 used, 3906556 free. 47780 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 33508 2772 1404 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u16:0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.24 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.02 rcuos/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuos/1
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuos/2
This server is completely unused and idle, yet it has a 1 minute load average of exactly 1.00?
As I watch the other values - the 5 minute and 15 minute also appear to be heading towards 1.00 so I assume they will all reach 1.00 at some point. (The "1 Running" is the top process)
I have never had this before and since I have no idea what is causing the startup crashing, I am assuming at this point that the two are likely related.
What I would like to do is identify (and hopefully eliminate) what is causing that false load average and my crashing issue.
So far I have been unable to identify what process could be waiting for a resource of some kind to generate that load average.
I would very much appreciate it if someone could help me to try and track it down.
top shows all processes pretty much always sleeping. Some occasionally popping up top but I think that's pretty normal. CPU usage is mostly showing 100% IDLE, with very occasional dips to 99% or so.
nmon doesn't show me much. everything just looks idle.
iotop shows pretty much no traffic whatsoever. (again, very occasional spots of disk access)
interrupt frequency seems low. way below 100/sec from what I can see.
I saw numerous google discussions suggesting this:
echo 100 > /sys/module/ipmi_si/parameters/kipmid_max_busy_us
..no effect.
RAM in the server is ECC and test passes.
Server install was 'minimal' (F4 option) with OpenSSH server ticked during install.
Installed a few packages afterwards including vim, bcache-tools, bridge-utils, qemu, software-properties-common, open-iscsi, qemu-kvm, cpu-checker, socat, ntp and nodejs. (Think that is about it)
I have tried disabling and removing the bcache kernel module. no effect.
stopped iscsi service.. no effect. (although there is absolutely nothing configured on this server yet)
I will leave it there before this gets insanely long. If anyone could help me try to figure this out it would be very much appreciated.
Cheers,
James
the load average of 1.0 is an artefact of bcache write-back thread staying in uninterruptible sleep. It may be corrected in 3.19 kernels or newer. See this Debian bug report for instance.

Cannot Understand the TOP command output on Hadoop Datanode

Hi I just installed Cloudera Manager on my cluster, 1 namenode and 4 datanodes, each data nodes has 64 GB RAM, 24 cores Xeon CPU, 16 1T disks SAS..etc.
I installed brand new Redhat Linux and upgraded to 6.5, each disk has been logically set up as RAID0 since there is no JBOD option available on the array controller.
I am running a hive query and here is the top command on the data node. I am so confused and wondering if some experienced hadoop admin could help me understand if my cluster is working fine.
Why there is only 1 task running out of 897 while the other 896 sleeping? There are 2271 mappers for that hive query and it is only 80% on the mapper side.
The load average is 8.66, I read from here that if you computer is working hard, the load average should be around the number of cores. Is my datanode working hard enought?
List item 69/70 memory has been "used", seems like the active yarn process is fairly low memory cost, how could those 64GB memory be so easily used up?
Here is the top output:
top - 22:50:24 up 1 day, 8:24, 3 users, load average: 8.66, 8.50, 7.95
Tasks: 897 total, 1 running, 896 sleeping, 0 stopped, 0 zombie
Cpu(s): 32.3%us, 5.2%sy, 0.0%ni, 62.3%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 70096068k total, 69286800k used, 809268k free, 222268k buffers
Swap: 4194296k total, 0k used, 4194296k free, 61468376k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
439 yarn 20 0 1417m 591m 19m S 193.9 0.9 1:06.12 java
561 yarn 20 0 1401m 581m 19m S 193.2 0.8 0:19.75 java
721 yarn 20 0 1415m 561m 19m S 172.0 0.8 0:08.54 java
611 yarn 20 0 1415m 574m 19m S 127.0 0.8 0:16.87 java
354 yarn 20 0 1428m 595m 19m S 121.4 0.9 0:35.96 java
27418 yarn 20 0 1513m 483m 18m S 13.6 0.7 18:26.14 java
16895 hdfs 20 0 1438m 410m 18m S 9.6 0.6 103:23.70 java
3726 hdfs 20 0 860m 249m 21m S 1.7 0.4 2:12.28 java
I am fairly new at system admin and any metric tool or common sense will be much appreciated! Thanks!

In BASH how can i find my system on active internet interface, what is the upload speed?

I am trying to write an TUI bandwidth trace application which on query can instantly tell me, that my download and upload speed is XXXX. I have figured out that download i can use with wget and parse it using BASH, but how do i get the upload speed?
Example of download parse method:
1) Remote download : wget http://x.x.com:7007/files/software/vnc.zip
Length: 1594344 (1.5M) [application/zip]
Saving to: `vnc.zip'
100%[==================================================================>] 1,594,344 573K/s in 2.7s
2012-03-24 11:35:22 (573 KB/s) - `vnc.zip' saved [1594344/1594344]
2) Local download tells
Length: 1594344 (1.5M) [application/zip]
Saving to: `vnc.zip'
100%[==================================================================>] 1,594,344 --.-K/s in 0.1s
2012-03-24 06:43:04 (11.4 MB/s) - `vnc.zip' saved [1594344/1594344]
Follow up:
Upload server:
$ iperf -s -p 65000
------------------------------------------------------------
Server listening on TCP port 65000
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local x.238 port 65000 connected with x.96 port 37463
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-11.9 sec 2.00 MBytes 1.40 Mbits/sec
Up-loader:
$ iperf -c x.238 -p 65000
------------------------------------------------------------
Client connecting to x.238, TCP port 65000
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local x.96 port 37463 connected with x.238 port 65000
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.4 sec 2.00 MBytes 1.61 Mbits/sec
wput! wget's twin-sister
Here is one sample run!
C:\Users\admin\Desktop\wput-pre0.6>wput C:\wput\pavan.txt ftp://admin:password#example.com
--16:55:00-- `C:/wput\pavan.txt'
=> ftp://padmin:xxxxx#example:21/C:/wput/pavan.txt
Connecting to example.com:21... connected!
Logging in as admin ... Logged in!
Length: 5
100%[===================================] 5
16:55:01 (pavan.txt) - `84.75B/s' [5]
FINISHED --16:55:01--
Transfered 5 bytes in 1 file at 3.73B/s
potential duplicate with this answer
wget --output-document=/dev/null http://speedtest.wdc01.softlayer.com/downloads/test500.zip

Resources