Cannot Understand the TOP command output on Hadoop Datanode - linux

Hi I just installed Cloudera Manager on my cluster, 1 namenode and 4 datanodes, each data nodes has 64 GB RAM, 24 cores Xeon CPU, 16 1T disks SAS..etc.
I installed brand new Redhat Linux and upgraded to 6.5, each disk has been logically set up as RAID0 since there is no JBOD option available on the array controller.
I am running a hive query and here is the top command on the data node. I am so confused and wondering if some experienced hadoop admin could help me understand if my cluster is working fine.
Why there is only 1 task running out of 897 while the other 896 sleeping? There are 2271 mappers for that hive query and it is only 80% on the mapper side.
The load average is 8.66, I read from here that if you computer is working hard, the load average should be around the number of cores. Is my datanode working hard enought?
List item 69/70 memory has been "used", seems like the active yarn process is fairly low memory cost, how could those 64GB memory be so easily used up?
Here is the top output:
top - 22:50:24 up 1 day, 8:24, 3 users, load average: 8.66, 8.50, 7.95
Tasks: 897 total, 1 running, 896 sleeping, 0 stopped, 0 zombie
Cpu(s): 32.3%us, 5.2%sy, 0.0%ni, 62.3%id, 0.2%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 70096068k total, 69286800k used, 809268k free, 222268k buffers
Swap: 4194296k total, 0k used, 4194296k free, 61468376k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
439 yarn 20 0 1417m 591m 19m S 193.9 0.9 1:06.12 java
561 yarn 20 0 1401m 581m 19m S 193.2 0.8 0:19.75 java
721 yarn 20 0 1415m 561m 19m S 172.0 0.8 0:08.54 java
611 yarn 20 0 1415m 574m 19m S 127.0 0.8 0:16.87 java
354 yarn 20 0 1428m 595m 19m S 121.4 0.9 0:35.96 java
27418 yarn 20 0 1513m 483m 18m S 13.6 0.7 18:26.14 java
16895 hdfs 20 0 1438m 410m 18m S 9.6 0.6 103:23.70 java
3726 hdfs 20 0 860m 249m 21m S 1.7 0.4 2:12.28 java
I am fairly new at system admin and any metric tool or common sense will be much appreciated! Thanks!

Related

The host memory displayed on the CDH is inconsistent with that queried with the top command

When I was about to clean up the memory of the Linux host, I used the top command to check the memory usage, and found that the result of the query was inconsistent with the host memory displayed by CDH.
and i don't know why and how do CDH get the memory of host
CDH version is: 6.3.2(pracel)
Tasks: 659 total, 1 running, 655 sleeping, 2 stopped, 1 zombie
%Cpu(s): 9.7 us, 2.0 sy, 0.2 ni, 87.9 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
GiB Mem : 125.2 total, 4.9 free, 84.3 used, 36.0 buff/cache
GiB Swap: 34.0 total, 24.4 free, 9.6 used. 28.5 avail Mem
the cdh display
96.9 GiB / 125.2 GiB

Java heap out of memory exception tomcat linux

Please help me,my live application sometimes throw exception out of memory java heap
however I set the max size to 512M half of virtual server size
I've searched on google and traced my Server like attached image
can anyone tell me where is the error please ?
the data in console is below
System load: 0.01 Processes: 74
Usage of /: 16.2% of 29.40GB Users logged in: 0
Memory usage: 60%
Swap usage: 0%
developer#pc:/$ free -m
total used free shared buffers cached
Mem: 994 754 239 0 24 138
-/+ buffers/cache: 592 401
Swap: 0 0 0

NUMA support on which CPU? What are the current server configuration of this kind of CPU?

NUMA support on which CPU? What are the current server configuration of this kind of CPU? Linux NUMA commands regarding what, how to open NUMA?
This is going to depend of your server, if it's using a multicore cpu that support Numa affinity. Type numactl --hardware and you'll check how it's the current configuration, for example:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32733 MB
node 0 free: 4027 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32767 MB
node 1 free: 20898 MB
node distances:
node 0 1
0: 10 21
1: 21 10
If you want to check performance with your application, just make sure that it's using the CPUs from the same numa node. You can check this using ps -aux ortop commands.

Low CPU usage on ubuntu 14.04 and nodejs

I have two servers running the exact same nodejs application. I am doing load testing and I can't figure out why one of my servers will not utilize more CPU and RAM.
It is much slower when load testing yet it is not even close to utilizing all the free CPU and memory.
If I run top during the load test, these are the numbers I am getting
PID User PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
1308 ubuntu 20 0 1002524 87508 9788 S 5.3 4.3 0:03.06 nodejs
1307 ubuntu 20 0 925540 75288 9436 S 5.0 3.7 0:02.17 nodejs
1308 ubuntu 20 0 992076 77068 9788 S 14.0 3.8 0:03.48 nodejs
1307 ubuntu 20 0 937140 86904 9436 S 2.7 4.3 0:02.25 nodejs
1308 ubuntu 20 0 1012936 98000 9788 S 14.3 4.8 0:03.91 nodejs
1307 ubuntu 20 0 942940 92644 9436 S 1.0 4.5 0:02.28 nodejs
1307 ubuntu 20 0 943204 92976 9436 S 6.0 4.6 0:02.46 nodejs
1308 ubuntu 20 0 1011764 96804 9788 S 6.0 4.7 0:04.09 nodejs
1307 ubuntu 20 0 933644 83388 9436 S 8.6 4.1 0:02.72 nodejs
1308 ubuntu 20 0 1008720 93556 9788 S 5.3 4.6 0:04.25 nodejs
1308 ubuntu 20 0 1000184 85256 9788 S 8.6 4.2 0:04.51 nodejs
1307 ubuntu 20 0 944092 93988 9436 S 7.6 4.6 0:02.95 nodejs
1307 ubuntu 20 0 941748 91816 9436 S 15.0 4.5 0:03.40 nodejs
1308 ubuntu 20 0 1004832 90008 9788 S 1.3 4.4 0:04.55 nodejs
1307 ubuntu 20 0 933460 82632 9436 S 9.0 4.1 0:03.67 nodejs
Running two processes I don't see memory getting above 4.7% and CPU is at 14%.
It is taking twice as long to serve the exact same resources as a machine with one core and half the memory.
My other server is using %52 of CPU. Granted it has one core and the above has two, but it doesn't seem like that would make the difference.
I downloaded cpufrequtils and set the GOVERNOR to performance but I don't think it is working. This is what I get when I run cpufreq-info
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
maximum transition latency: 4294.55 ms.
analyzing CPU 1:
no or unknown cpufreq driver is active on this CPU
maximum transition latency: 4294.55 ms.
Here is the CPU
Intel(R) Core(TM)2 CPU 6300 # 1.86GHz
Any ideas or hints would be appreciated
If both servers are running the same node.js application, then you may want to
compare the other settings on the machines, are they the same? ulimit -a
Also for dual/multicore core machines, node.js is single threaded, it will not benefit from dual/multicores unless you use cluster to make use of it.

ubuntu 14.04.1 server idle load average 1.00

Scratching my head here. Hoping someone can help me troubleshoot.
I have a Dell PowerEdge SC1435 server which had been running with a previous version of ubuntu for a while. (I believe it was 13.10 server x64)
I recently reformatted the drive (SSD) and installed ubuntu server 14.04.1 x64.
All seemed fine through the install but the machine hung on first boot at the end of the kernel output, just before I would expect the screen to clear and a logon prompt appear. There were no obvious errors at the end of the kernel output that I saw. (There was a message about "not using cpu thermal sensor that is unreliable" but that appears to be there regardless of whether it boots or not)
I gave it a good 5 minutes and then forced a reboot. To my surprise it booted to the logon prompt in about 1-2 seconds after bios post. I rebooted again and it seemed to pause for a few extra seconds where it hung before, but proceeded to the login screen. Rebooting again it was fast again. So at this point I thought it was just one of those random one-off glitches that I would never explain so I moved on.
I installed a few packages (exact same packages installed on the same OS version on other hardware), did apt upgrade and dist-upgrade then rebooted. It seemed to hang again so I drove to the datacentre and connected a console only to get a blank screen. Forced reboot again. (also setup ipmi for remote rebooting and got rid of the grub recordfail so it would not wait for me to press enter!)
That was very late last night. I came home, did a few reboots with no issue so went to bed.
Today I did a reboot again to check it and again it crashed somewhere. I remotely force rebooted it.
As this point I started digging a little more and immediately noticed something really strange.
top - 14:18:35 up 8 min, 1 user, load average: 1.00, 0.85, 0.45
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 0.3 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 33013620 total, 338928 used, 32674692 free, 9740 buffers
KiB Swap: 3906556 total, 0 used, 3906556 free. 47780 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 33508 2772 1404 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u16:0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.24 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.02 rcuos/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuos/1
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuos/2
This server is completely unused and idle, yet it has a 1 minute load average of exactly 1.00?
As I watch the other values - the 5 minute and 15 minute also appear to be heading towards 1.00 so I assume they will all reach 1.00 at some point. (The "1 Running" is the top process)
I have never had this before and since I have no idea what is causing the startup crashing, I am assuming at this point that the two are likely related.
What I would like to do is identify (and hopefully eliminate) what is causing that false load average and my crashing issue.
So far I have been unable to identify what process could be waiting for a resource of some kind to generate that load average.
I would very much appreciate it if someone could help me to try and track it down.
top shows all processes pretty much always sleeping. Some occasionally popping up top but I think that's pretty normal. CPU usage is mostly showing 100% IDLE, with very occasional dips to 99% or so.
nmon doesn't show me much. everything just looks idle.
iotop shows pretty much no traffic whatsoever. (again, very occasional spots of disk access)
interrupt frequency seems low. way below 100/sec from what I can see.
I saw numerous google discussions suggesting this:
echo 100 > /sys/module/ipmi_si/parameters/kipmid_max_busy_us
..no effect.
RAM in the server is ECC and test passes.
Server install was 'minimal' (F4 option) with OpenSSH server ticked during install.
Installed a few packages afterwards including vim, bcache-tools, bridge-utils, qemu, software-properties-common, open-iscsi, qemu-kvm, cpu-checker, socat, ntp and nodejs. (Think that is about it)
I have tried disabling and removing the bcache kernel module. no effect.
stopped iscsi service.. no effect. (although there is absolutely nothing configured on this server yet)
I will leave it there before this gets insanely long. If anyone could help me try to figure this out it would be very much appreciated.
Cheers,
James
the load average of 1.0 is an artefact of bcache write-back thread staying in uninterruptible sleep. It may be corrected in 3.19 kernels or newer. See this Debian bug report for instance.

Resources