Cryptic Node Tick Result -- CPU Taking 100% - node.js

So ive run into a slight snag with an application that I am writing. The application is a simple relay server, using Socket.IO and Node. Everything works great, but when it gets under heavy load, the process spikes to 100% CPU and stays there.
I ran a node-tick on it, and this is what the result was
Statistical profiling result from v8.log, (83336 ticks, 0 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
83220 99.9% 0.0% /lib/x86_64-linux-gnu/libc-2.19.so
97 0.1% 0.0% /usr/bin/nodejs
9 0.0% 0.0% /lib/x86_64-linux-gnu/libpthread-2.19.so
[JavaScript]:
ticks total nonlib name
1 0.0% 10.0% Stub: FastNewClosureStub
1 0.0% 10.0% Stub: CallConstructStub_Recording
1 0.0% 10.0% LazyCompile: ~stringify native json.js:308
1 0.0% 10.0% LazyCompile: ~hash /opt/connect/node/node_modules/sticky-session/lib/sticky-session.js:4
1 0.0% 10.0% LazyCompile: ~exports.dirname path.js:415
1 0.0% 10.0% LazyCompile: ~buildFn /opt/connect/node/node_modules/socket.io-redis/node_modules/msgpack-js/node_modules/bops/read.js:5
1 0.0% 10.0% LazyCompile: ~addListener events.js:126
1 0.0% 10.0% LazyCompile: ~ToObject native runtime.js:567
1 0.0% 10.0% LazyCompile: *setTime native date.js:482
1 0.0% 10.0% LazyCompile: *DefineOneShotAccessor native messages.js:767
[C++]:
ticks total nonlib name
[GC]:
ticks total nonlib name
3 0.0%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
83220 99.9% /lib/x86_64-linux-gnu/libc-2.19.so
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
83192 99.8% 83192 99.8% /lib/x86_64-linux-gnu/libc-2.19.so
114 0.1% 0 0.0% Function: ~<anonymous> node.js:27
113 0.1% 0 0.0% LazyCompile: ~startup node.js:30
103 0.1% 0 0.0% LazyCompile: ~Module.runMain module.js:495
98 0.1% 0 0.0% LazyCompile: Module._load module.js:275
90 0.1% 0 0.0% LazyCompile: ~Module.load module.js:346
89 0.1% 0 0.0% LazyCompile: ~Module._extensions..js module.js:472
88 0.1% 0 0.0% LazyCompile: ~Module._compile module.js:374
88 0.1% 0 0.0% Function: ~<anonymous> /opt/connect/node/lib/connect.js:1
83220 99.9% 0.0% /lib/x86_64-linux-gnu/libc-2.19.so
Only 1 library seems to be the problem -- but I can figure out what the heck it is for. My hunch is the json processing im doing -- but I can be sure.
Has anyone run into this problem before?

Related

boost::asio::io_service performance loss when network card set irq affinity

I've been testing the performance(qps) of pbrpc framework(sofa-pbrpc).
When I used cpu0 dealing software interrupt, the qps is 100w+.
top command
top - 18:00:54 up 158 days, 18 min, 3 users, load average: 18.09, 19.29, 18.74
Tasks: 793 total, 3 running, 790 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 100.0% si
Cpu1 : 58.7% us, 11.6% sy, 0.7% ni, 28.4% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu2 : 68.0% us, 11.6% sy, 0.0% ni, 19.8% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu3 : 52.3% us, 8.3% sy, 0.0% ni, 39.1% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu4 : 73.5% us, 11.9% sy, 0.0% ni, 13.9% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu5 : 52.5% us, 10.2% sy, 0.3% ni, 36.6% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu6 : 72.4% us, 13.2% sy, 0.0% ni, 13.2% id, 1.3% wa, 0.0% hi, 0.0% si
Cpu7 : 74.7% us, 11.8% sy, 0.0% ni, 12.5% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu8 : 72.5% us, 12.8% sy, 0.0% ni, 14.1% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu9 : 82.9% us, 13.5% sy, 0.0% ni, 3.0% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu10 : 82.1% us, 14.6% sy, 0.0% ni, 3.0% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu11 : 85.9% us, 13.2% sy, 0.0% ni, 0.7% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu12 : 84.2% us, 14.5% sy, 0.0% ni, 0.0% id, 1.3% wa, 0.0% hi, 0.0% si
Cpu13 : 67.4% us, 12.2% sy, 0.0% ni, 19.7% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu14 : 72.6% us, 13.5% sy, 0.0% ni, 13.2% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu15 : 77.4% us, 12.8% sy, 0.0% ni, 8.9% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu16 : 84.5% us, 14.5% sy, 0.0% ni, 0.3% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu17 : 84.1% us, 14.2% sy, 0.0% ni, 1.0% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu18 : 84.1% us, 13.6% sy, 0.0% ni, 2.0% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu19 : 83.2% us, 14.2% sy, 0.0% ni, 1.7% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu20 : 28.7% us, 4.3% sy, 0.0% ni, 63.7% id, 3.3% wa, 0.0% hi, 0.0% si
Cpu21 : 83.2% us, 13.9% sy, 0.0% ni, 2.3% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu22 : 1.0% us, 2.3% sy, 0.3% ni, 96.4% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu23 : 0.3% us, 0.3% sy, 0.3% ni, 99.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu24 : 84.8% us, 13.2% sy, 0.0% ni, 1.0% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu25 : 84.8% us, 14.6% sy, 0.0% ni, 0.0% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu26 : 84.8% us, 14.5% sy, 0.0% ni, 0.0% id, 0.7% wa, 0.0% hi, 0.0% si
Cpu27 : 83.4% us, 14.9% sy, 0.0% ni, 0.7% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu28 : 83.9% us, 14.8% sy, 0.0% ni, 0.3% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu29 : 23.1% us, 3.6% sy, 0.0% ni, 72.9% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu30 : 2.0% us, 0.7% sy, 0.3% ni, 97.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu31 : 3.9% us, 1.3% sy, 0.0% ni, 94.7% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 132133912k total, 4835356k used, 127298556k free, 153412k buffers
However, using set_irq_affinity.sh, the qps degraded to 70w+.
top command
top - 18:05:38 up 158 days, 22 min, 3 users, load average: 13.13, 18.03, 18.51
Tasks: 801 total, 2 running, 798 sleeping, 1 stopped, 0 zombie
Cpu0 : 62.2% us, 13.2% sy, 0.0% ni, 4.6% id, 0.7% wa, 0.0% hi, 19.4% si
Cpu1 : 55.8% us, 10.9% sy, 0.3% ni, 15.5% id, 0.0% wa, 0.0% hi, 17.5% si
Cpu2 : 59.5% us, 12.5% sy, 0.0% ni, 12.8% id, 0.3% wa, 0.0% hi, 14.8% si
Cpu3 : 33.7% us, 9.6% sy, 0.0% ni, 47.9% id, 0.3% wa, 0.0% hi, 8.6% si
Cpu4 : 59.1% us, 12.2% sy, 0.0% ni, 11.2% id, 0.7% wa, 0.0% hi, 16.8% si
Cpu5 : 51.0% us, 10.5% sy, 0.0% ni, 27.0% id, 0.3% wa, 0.0% hi, 11.2% si
Cpu6 : 66.9% us, 12.8% sy, 0.0% ni, 3.6% id, 0.7% wa, 0.0% hi, 16.1% si
Cpu7 : 64.8% us, 12.8% sy, 0.0% ni, 3.0% id, 0.3% wa, 0.0% hi, 19.1% si
Cpu8 : 48.0% us, 9.2% sy, 0.3% ni, 32.2% id, 0.3% wa, 0.0% hi, 9.9% si
Cpu9 : 50.0% us, 10.3% sy, 0.0% ni, 25.2% id, 0.3% wa, 0.0% hi, 14.2% si
Cpu10 : 63.0% us, 11.9% sy, 0.0% ni, 7.9% id, 0.3% wa, 0.0% hi, 16.8% si
Cpu11 : 60.9% us, 12.8% sy, 0.0% ni, 8.9% id, 0.3% wa, 0.0% hi, 17.1% si
Cpu12 : 62.6% us, 12.6% sy, 0.0% ni, 3.6% id, 0.3% wa, 0.0% hi, 20.9% si
Cpu13 : 56.2% us, 11.8% sy, 0.0% ni, 19.1% id, 0.3% wa, 0.0% hi, 12.5% si
Cpu14 : 64.0% us, 13.2% sy, 0.0% ni, 6.9% id, 0.3% wa, 0.0% hi, 15.5% si
Cpu15 : 65.0% us, 14.5% sy, 0.0% ni, 3.3% id, 0.7% wa, 0.0% hi, 16.5% si
Cpu16 : 65.8% us, 12.5% sy, 0.0% ni, 1.6% id, 0.3% wa, 0.0% hi, 19.7% si
Cpu17 : 63.2% us, 12.9% sy, 0.0% ni, 2.6% id, 0.3% wa, 0.0% hi, 20.9% si
Cpu18 : 15.2% us, 4.0% sy, 0.0% ni, 77.6% id, 0.0% wa, 0.0% hi, 3.3% si
Cpu19 : 58.4% us, 12.2% sy, 0.0% ni, 12.9% id, 0.3% wa, 0.0% hi, 16.2% si
Cpu20 : 48.8% us, 10.2% sy, 0.0% ni, 27.4% id, 0.3% wa, 0.0% hi, 13.2% si
Cpu21 : 31.6% us, 5.9% sy, 0.0% ni, 54.3% id, 0.3% wa, 0.0% hi, 7.9% si
Cpu22 : 0.7% us, 1.3% sy, 0.0% ni, 98.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu23 : 64.8% us, 12.8% sy, 0.0% ni, 1.6% id, 0.7% wa, 0.0% hi, 20.1% si
Cpu24 : 64.8% us, 12.2% sy, 0.0% ni, 5.3% id, 0.7% wa, 0.0% hi, 17.1% si
Cpu25 : 59.5% us, 14.1% sy, 0.0% ni, 8.9% id, 0.3% wa, 0.0% hi, 17.1% si
Cpu26 : 63.7% us, 13.5% sy, 0.0% ni, 3.0% id, 0.0% wa, 0.0% hi, 19.8% si
Cpu27 : 60.7% us, 12.2% sy, 0.0% ni, 9.2% id, 0.3% wa, 0.0% hi, 17.5% si
Cpu28 : 62.2% us, 13.5% sy, 0.0% ni, 1.3% id, 0.3% wa, 0.0% hi, 22.7% si
Cpu29 : 12.5% us, 4.6% sy, 0.3% ni, 79.2% id, 0.0% wa, 0.0% hi, 3.3% si
Cpu30 : 2.3% us, 2.0% sy, 1.3% ni, 94.1% id, 0.0% wa, 0.0% hi, 0.3% si
Cpu31 : 3.3% us, 1.7% sy, 0.0% ni, 94.7% id, 0.0% wa, 0.0% hi, 0.3% si
Mem: 132133912k total, 4888040k used, 127245872k free, 153440k buffers
Is there some special handling in boost asio?

CPU Higher than expected in Node running in docker

I have a vagrant machine running at 33% CPU on my Mac (10.9.5) when nothing is supposed to be happening. The VM machine is run by Kinematic. Looking inside one of the containers I see 2 node (v0.12.2) processes running at 3-4% CPU each.
root#49ab3ab54901:/usr/src# top -bc
top - 03:11:59 up 8:31, 0 users, load average: 0.13, 0.18, 0.22
Tasks: 7 total, 1 running, 6 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.7 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 2051824 total, 1942836 used, 108988 free, 74572 buffers
KiB Swap: 1466848 total, 18924 used, 1447924 free. 326644 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 4332 672 656 S 0.0 0.0 0:00.10 /bin/sh -c node -e "require('./seed/seeder.js').seed().then(function (resp) { console.log('successfully seeded!'); pro+
15 root 20 0 737320 81008 13452 S 0.0 3.9 0:32.57 node /usr/local/bin/nodemon app/api.js
33 root 20 0 4332 740 652 S 0.0 0.0 0:00.00 sh -c node app/api.js
34 root 20 0 865080 68952 14244 S 0.0 3.4 0:01.70 node app/api.js
83 root 20 0 20272 3288 2776 S 0.0 0.2 0:00.11 bash
18563 root 20 0 20248 3152 2840 S 0.0 0.2 0:00.11 bash
18575 root 20 0 21808 2308 2040 R 0.0 0.1 0:00.00 top -bc
I went on and runned a node --prof and processed the log with node-tick-processor. It looks like that 99.3% of CPU is used in the syscall :
(for full output see http://pastebin.com/6qgFuFWK )
root#d6d78487e1ec:/usr/src# node-tick-processor isolate-0x26c0180-v8.log
...
Statistical profiling result from isolate-0x26c0180-v8.log, (130664 ticks, 0 unaccounted, 0 excluded).
...
[C++]:
ticks total nonlib name
129736 99.3% 99.3% syscall
160 0.1% 0.1% node::ContextifyScript::New(v8::FunctionCallbackInfo<v8::Value> const&)
124 0.1% 0.1% __write
73 0.1% 0.1% __xstat
18 0.0% 0.0% v8::internal::Heap::AllocateFixedArray(int, v8::internal::PretenureFlag)
18 0.0% 0.0% node::Stat(v8::FunctionCallbackInfo<v8::Value> const&)
17 0.0% 0.0% __lxstat
16 0.0% 0.0% node::Read(v8::FunctionCallbackInfo<v8::Value> const&)
...
1 0.0% 0.0% __fxstat
1 0.0% 0.0% _IO_default_xsputn
[GC]:
ticks total nonlib name
22 0.0%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
129736 99.3% syscall
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
129736 99.3% 129736 99.3% syscall
865 0.7% 0 0.0% Function: ~<anonymous> node.js:27:10
864 0.7% 0 0.0% LazyCompile: ~startup node.js:30:19
851 0.7% 0 0.0% LazyCompile: ~Module.runMain module.js:499:26
799 0.6% 0 0.0% LazyCompile: Module._load module.js:273:24
795 0.6% 0 0.0% LazyCompile: ~Module.load module.js:345:33
794 0.6% 0 0.0% LazyCompile: ~Module._extensions..js module.js:476:37
792 0.6% 0 0.0% LazyCompile: ~Module._compile module.js:378:37
791 0.6% 0 0.0% Function: ~<anonymous> /usr/src/app/api.js:1:11
791 0.6% 0 0.0% LazyCompile: ~require module.js:383:19
791 0.6% 0 0.0% LazyCompile: ~Module.require module.js:362:36
791 0.6% 0 0.0% LazyCompile: Module._load module.js:273:24
788 0.6% 0 0.0% LazyCompile: ~Module.load module.js:345:33
786 0.6% 0 0.0% LazyCompile: ~Module._extensions..js module.js:476:37
783 0.6% 0 0.0% LazyCompile: ~Module._compile module.js:378:37
644 0.5% 0 0.0% Function: ~<anonymous> /usr/src/app/api.authentication.js:1:11
627 0.5% 0 0.0%
...
A strace resulted in nothing abnormal:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.51 0.001681 76 22 clone
17.28 0.000533 4 132 epoll_ctl
16.80 0.000518 24 22 wait4
6.39 0.000197 2 110 66 stat
5.03 0.000155 1 176 close
0.00 0.000000 0 176 read
0.00 0.000000 0 88 write
0.00 0.000000 0 44 rt_sigaction
0.00 0.000000 0 88 rt_sigprocmask
0.00 0.000000 0 22 rt_sigreturn
0.00 0.000000 0 66 ioctl
0.00 0.000000 0 66 socketpair
0.00 0.000000 0 88 epoll_wait
0.00 0.000000 0 22 pipe2
------ ----------- ----------- --------- --------- ----------------
100.00 0.003084 1122 66 total
And the other node process:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 14 epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 14 total
Am I missing something?
I wonder if it is VirtualBox's or Docker's layers consuming 4%.
When you have a few containers with 2 processes running at 4%, it adds up quickly.

NodeJS 100% cpu usage - epoll_wait

Im trying to track down why my nodejs app all a sudden uses 100% cpu. The app has around 50 concurrent connections and is running on a ec2 micro instance.
Below is the output of: strace -c node server.js
^C% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
87.32 0.924373 8 111657 epoll_wait
6.85 0.072558 3 22762 pread
2.55 0.026965 0 146179 write
0.92 0.009733 0 108434 1 futex
0.44 0.004661 0 82010 7 read
0.44 0.004608 0 223317 clock_gettime
0.31 0.003244 0 172467 gettimeofday
0.31 0.003241 35 93 brk
0.20 0.002075 0 75233 3 epoll_ctl
0.19 0.002052 0 23850 11925 accept4
0.19 0.001997 0 12302 close
0.19 0.001973 7 295 mmap
0.06 0.000617 4 143 munmap
And here is the output of: node-tick-processor
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
669160 97.4% 669160 97.4% /lib/x86_64-linux-gnu/libc-2.15.so
4834 0.7% 28 0.0% LazyCompile: *Readable.push _stream_readable.js:116
4750 0.7% 10 0.0% LazyCompile: *emitReadable _stream_readable.js:392
4737 0.7% 19 0.0% LazyCompile: *emitReadable_ _stream_readable.js:407
1751 0.3% 7 0.0% LazyCompile: ~EventEmitter.emit events.js:53
1081 0.2% 2 0.0% LazyCompile: ~<anonymous> _stream_readable.js:741
1045 0.2% 1 0.0% LazyCompile: ~EventEmitter.emit events.js:53
960 0.1% 1 0.0% LazyCompile: *<anonymous> /home/ubuntu/node/node_modules/redis/index.js:101
948 0.1% 11 0.0% LazyCompile: RedisClient.on_data /home/ubuntu/node/node_modules/redis/index.js:541
This is my first time debugging a node app. Are there any conclusions that can be drawn from the above debug output? Where could the error be?
Edit
My node version: v0.10.25
Edit 2
After updating node to: v0.10.33
Here is the output
^C% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.81 1.894522 8 225505 45 epoll_wait
3.58 0.073830 1 51193 pread
1.59 0.032874 0 235054 2 write
0.98 0.020144 0 1101789 clock_gettime
0.71 0.014658 0 192494 1 futex
0.57 0.011764 0 166704 21 read
Seems like Node JS v0.10.25 bug with event loop, look here.
Note, from this github pull request:
If the same file description is open in two different processes, then
closing the file descriptor is not sufficient to deregister it from
the epoll instance (as described in epoll(7)), resulting in spurious
events that cause the event loop to spin repeatedly. So always
explicitly deregister it.
So as solution you can try update your OS or update Node JS.

How to balance the cassandra cluster

We have 30 hadoop prod nodes and they the cluster is unbalanced.
Datacenter: Hadoop
==================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN x.x.x.1 13.64 GB 1 0.0% 16e90f33-6f79-41e8-bd25-22e5eea6707b RAC-1
UN x.x.x.2 11.27 GB 1 0.0% 3a9da644-2587-42b3-9162-7fb504cfbf97 RAC-3
UN x.x.x.3 10.76 GB 1 0.0% e0b67015-bb37-466b-b837-17d758535a4d RAC-1
UN x.x.x.4 10.07 GB 1 0.0% 2110d060-3d6f-404a-8979-a4c98b1e63ae RAC-1
UN x.x.x.5 260.31 GB 1 0.0% 48da7d4d-f396-4d12-a481-c316b367194b RAC-1
UN x.x.x.6 17.32 GB 1 0.0% 45806806-5c84-4898-9835-9e0874a48ded RAC-2
UN x.x.x.7 7.66 GB 1 0.0% 2724dddf-011e-431f-8936-0dc576ee833e RAC-2
UN x.x.x.8 11.95 GB 1 0.0% 61e83114-32ad-4abe-b6bf-228792bcd0fa RAC-2
UN x.x.x.9 16.38 GB 1 0.0% 988ec834-c4e5-4170-9c04-c5a7d20b3094 RAC-3
UN x.x.x.10 10.53 GB 1 0.0% 0b53cf24-541f-4e25-810b-7020e10817ee RAC-3
UN x.x.x.11 10.3 GB 1 0.0% ae14518d-c1c2-4c21-998b-273e961cd1c0 RAC-2
UN x.x.x.12 15.42 GB 1 0.0% 019f1a17-11c6-4a38-b7f9-658030fe0eac RAC-2
UN x.x.x.13 11.1 GB 1 0.0% 0cdabec3-95e2-4451-ba69-a659874fe300 RAC-2
UN x.x.x.14 9.58 GB 1 0.0% 4c43064c-be29-4723-be2d-cd0e3cd92909 RAC-3
UN x.x.x.15 11.73 GB 1 0.0% cc469cee-1ca0-45a9-bfd0-c182b0727238 RAC-2
UN x.x.x.16 9.55 GB 1 0.0% 0ccd790e-7d1c-4cc8-8ebb-9ec786ee9962 RAC-2
UN x.x.x.17 9.44 GB 1 0.0% 3a60244e-8af9-45a4-988a-d5158fbe04e7 RAC-3
UN x.x.x.18 11.44 GB 1 0.0% 0b7508e9-e06f-4532-841c-08047f6ebf35 RAC-3
UN x.x.x.19 13.25 GB 1 0.0% 7648792b-9c92-45be-b82a-e171d21756d6 RAC-3
UN x.x.x.20 256.81 GB 1 0.0% 92033ad7-d60f-4a89-9439-0af7e744a246 RAC-2
UN x.x.x.21 10.03 GB 1 0.0% e494a90f-64b1-4f84-94fa-b228a8ef3160 RAC-1
UN x.x.x.22 9.32 GB 1 0.0% 64f9a2e4-2aab-408c-9d5f-5867ab26398c RAC-3
UN x.x.x.23 14.74 GB 1 0.0% 0ea50e73-b36c-44e9-934f-a92f14acbe23 RAC-1
UN x.x.x.24 12.37 GB 1 0.0% 804927a6-d096-4b6e-92af-43ad13e10504 RAC-1
UN x.x.x.24 258.85 GB 1 0.0% 7c1bc96c-4806-4216-bd6a-a28db4b528d1 RAC-3
UN x.x.x.26 10.38 GB 1 0.0% 2932bae4-c656-4378-9570-0f79131fe3a8 RAC-2
UN x.x.x.27 11.67 GB 1 0.0% 918bc253-40b6-4a56-ab4a-e50953ffb355 RAC-1
UN x.x.x.28 8.17 GB 1 0.0% bb302317-3671-4174-82b0-43f53c683f44 RAC-1
UN x.x.x.29 10.57 GB 1 0.0% ff4a1e2e-249b-44d7-b488-0acd99a6db86 RAC-1
UN x.x.x.30 12.27 GB 1 0.0% df75362f-24c0-4783-a03c-b2a578c0927d RAC-3
Almost all the data is going to only 3 nodes(5,20 and 24) in the cluster. I cannot use vnodes because DSE hadoop nodes not support it.
I know we can use nodetool move to rearrange the tokens but how to calculate the middle token?
How to balance it? how to calculate the tokens and move it?

When I using MTR, why farther node values lower?

The mtr report like this:
shell> mtr --report ec2-122-248-229-83.ap-southeast-1.compute.amazonaws.com
HOST: macserver.local Loss% Snt Last Avg Best Wrst StDev
1.|-- 192.168.12.1 0.0% 10 1.2 2.9 0.9 7.4 2.3
2.|-- 101.36.89.49 0.0% 10 6.8 5.7 2.1 16.6 4.3
3.|-- 192.168.17.37 0.0% 10 53.8 164.9 4.9 904.4 304.0
4.|-- 220.181.105.25 0.0% 10 5.1 11.1 5.1 26.9 7.1
5.|-- 220.181.0.5 0.0% 10 68.5 15.1 4.9 68.5 19.4
6.|-- 220.181.0.41 0.0% 10 12.6 10.2 5.0 27.1 6.5
7.|-- 202.97.53.82 0.0% 10 7.2 9.9 4.9 28.1 6.7
8.|-- 202.97.58.94 0.0% 10 16.5 10.0 5.2 16.5 3.9
9.|-- 202.97.61.98 0.0% 10 49.2 46.4 39.0 76.7 11.2
10.|-- 202.97.121.98 0.0% 10 41.1 43.5 41.1 46.3 1.6
11.|-- 63-218-213-206.static.pcc 0.0% 10 87.2 77.6 70.3 92.2 7.4
12.|-- 203.83.223.62 0.0% 10 71.9 74.8 69.9 87.2 5.1
13.|-- 203.83.223.77 0.0% 10 73.6 73.8 70.2 80.9 3.0
14.|-- ec2-175-41-128-238.ap-sou 0.0% 10 70.4 73.9 70.4 84.1 4.0
15.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
16.|-- ec2-122-248-229-83.ap-sou 10.0% 10 82.3 76.0 70.6 88.7 6.1
Why is the average of the 16 lines lower than line 11?
Routers are designed to route packets as quickly as possible. They're not designed to generate and transmit ICMP errors as quickly as possible. Apparently, the machine at line 11 is very slow at generating ICMP errors.
When you see a lower time past a hop than at that hop, you know that most likely it took that router a significant amount of time to generate an ICMP error and get it going back to you.
And, of course, you have to ignore line 15. You didn't get any replies from that router.

Resources