Add node(s), new rack - cassandra

I have a 16 node Cassandra 2.1.11 Cluster, divided into 2 racks (R0 and R1), 8 nodes per rack. Each node serves about 700Gb of data. The cluster looks pretty balanced. Each node has 2x3Tb HDD.
Datacenter: DC0
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.21.72 677.37 GB 256 ? 0260665f-5f5c-4fbc-9583-8da86848713a R0
UN 192.168.21.73 658.8 GB 256 ? ed1e7814-f715-41c8-97c9-8b52164835f9 R1
UN 192.168.21.74 676.97 GB 256 ? 62833182-b339-46f2-9370-0e23f6bb7eab R1
UN 192.168.21.75 657.1 GB 256 ? ab31f28b-ffea-489f-a4da-3b5120760b8e R1
UN 192.168.21.76 690.29 GB 256 ? e636bf2e-89d6-4bf3-9263-cf1ed67fcbd9 R1
UN 192.168.21.77 679.77 GB 256 ? 959e5207-1251-4c58-afa9-3910b5a27ff5 R1
UN 192.168.21.78 648.85 GB 256 ? 6f650315-1cd1-4169-b300-391800be974f R1
UN 192.168.21.79 675.96 GB 256 ? 324bd475-b5f6-4b39-a753-0cd2c80a46c4 R1
UN 192.168.21.65 636.01 GB 256 ? 65e3faa1-e8d5-4d78-a87e-bfde1f4095a5 R0
UN 192.168.21.66 674.89 GB 256 ? 213696eb-c4a0-4803-a9b3-0efd04c567f2 R0
UN 192.168.21.67 716.77 GB 256 ? 62542a8e-8177-4f13-9077-ea2426607ace R0
UN 192.168.21.68 666.1 GB 256 ? a9864059-3de2-48a2-a926-00db3f9791ee R0
UN 192.168.21.69 691.9 GB 256 ? 02ea1b28-90f9-4837-8173-ff79fa6966d7 R0
UN 192.168.21.70 681.16 GB 256 ? a9c8adae-e54e-4e8e-a333-eb9b2b52bfed R0
UN 192.168.21.71 653.18 GB 256 ? 6aa8cf0c-069a-4049-824a-8359d1c58e59 R0
UN 192.168.21.80 694.14 GB 256 ? 7abb5609-7dca-465a-a68c-972e54469ad6 R1
Now I'm trying to expand the cluster by adding 16 more nodes, also divided into 2 racks (R2 and R3). After adding all new nodes I expect a 32 node cluster, divided by 4 racks, with 350Gb of data on each node.
I add one node at a time according to the Cassandra documentation. I started Cassandra process at the first node, with the same configuration as existing nodes, but in R3 (new) rack. It caused 16 streams from existing nodes to newly added node, 250Gb of data in each, all data successfully transferred to new node, at this point process looks normal.
But after that data size lands on the new node, as shown by nodetool status it's starting to increase, now it already says 1.7Tb and keeps growing.
UJ 192.168.21.89 1.69 TB 256 ? 42a80db9-59d6-44b6-b79c-ac7819f69cee R3
It's something opposite to what I expected (350Gb per node but not 1.7Tb).
4Tb disk space out of 6Tb already used by Cassandra data dir on the new node.
I've thought that it isn't normal and have stopped the process.
Now I'm wondering what I'm doing wrong and what I should do to add 16 nodes properly and have 32 nodes with 350Gb on each node in the end. Should I expand existing racks instead of adding new ones? Should I calculate tokens for new nodes? Any other options?

Related

Cassandra deleted rows coming back (reappear) when nodes have NTP sync issue

I have 3 nodes cassandra setup and seems like some nodes had time sync issues, that is some nodes 10 minutes ahead of others.
CT-Cass2:/root>nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.94.1.22 14.15 GB 256 ? db37ca57-c7c9-4c36-bac3-f0cbd8516143 RAC1
UN 172.94.1.23 14.64 GB 256 ? b6927b2b-37b2-4a7d-af44-21c9f548c533 RAC1
UN 172.94.1.21 14.42 GB 256 ? e482b781-7e9f-43e2-82f8-92901be48eed RAC1
I have below table created.
CREATE TABLE test_users (
userid text PRIMARY KEY,
omavvmon int,
vvmon int
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 48000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I can see that in customer setup some of the deleted records coming back and shows writetime(omavvmon) shows writetime of 10 minutes later than the row delete time. I am almost certain that records are coming back due to time sync issue (Because after correcting time its not happening). But when i tried to reproduce this issue locally it never happens.
I set cassandra system time 10 minutes ahead and create row. and writetime shows 10 minutes ahead
update test_users set omavvmon=1 where userid='4444';
I set the system time back to normal, that is 10 mins slower. Then i perform delete of userid 4444.
As i understand, this delete is 10 minutes lesser writetime compared to first creation and i should see records coming back again. But its not happening. Can any help me explain why deleted records coming back in production setup and not in my local setup? Also why cassandra is not showing the record locally even though delete has lesser timestamp compared to insert? Isnt it similar to delete then insert?
In production i check after few hours but local setup i am checking immediately after delete.

What does LOAD in nodetool status measure?

I am observing higher load on a Cassandra node (compared to other nodes in the ring) and I am looking for help interpreting this data. I have anonymized my IPs but the snipped below shows a comparison of "good" node 199 (load 14G) and "bad" node 159(load 25G):
nodetool status|grep -E '199|159'
UN XXXXX.159 25.2 GB 256 ? ffda4798-tokentoken XXXXXX
UN XXXXX.199 13.37 GB 256 ? c3a49dca-tokentoken YYYY
Note load is almost 2x on .159. Yet neither memory nor disk usage explain/support this:
.199 (low load box) data -- memory at about 34%, disk 50-60G:
top|grep apache_cassan
28950 root 20 0 24.353g 0.010t 1.440g S 225.3 34.2 25826:35 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.448g S 212.4 34.2 25826:41 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.452g S 219.7 34.3 25826:48 apache_cassandr
28950 root 20 0 24.357g 0.011t 1.460g S 250.5 34.3 25826:55 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 559G 47G 513G 9% /cassandra/data_dir_a
/dev/sdf1 559G 63G 497G 12% /cassandra/data_dir_b
/dev/sdg1 559G 54G 506G 10% /cassandra/data_dir_c
/dev/sdh1 559G 57G 503G 11% /cassandra/data_dir_d
.159 (high load box) data -- memory at about 28%, disk 20-40G:
top|grep apache_cassan
25354 root 20 0 36.297g 0.017t 8.608g S 414.7 27.8 170:42.81 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.608g S 272.2 27.8 170:51.00 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.612g S 129.7 27.8 170:54.90 apache_cassandr
25354 root 20 0 36.354g 0.017t 8.625g S 94.1 27.8 170:57.73 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 838G 17G 822G 2% /cassandra/data_dir_a
/dev/sdf1 838G 11G 828G 2% /cassandra/data_dir_b
/dev/sdg1 838G 35G 804G 5% /cassandra/data_dir_c
/dev/sdh1 838G 26G 813G 4% /cassandra/data_dir_d
TL;DR version -- what does nodetool status 'load' column actually measure/report
The nodetool status command provides the following information:
Status - U (up) or D (down)
Indicates whether the node is functioning or not.
Load - updates every 90 seconds
The amount of file system data under the cassandra data directory after excluding all content in the snapshots subdirectories. Because all SSTable data files are included, any data that is not cleaned up, such as TTL-expired cell or tombstoned data) is counted.
For more information go to nodetool status output description

Cassandra issuing an error while selecting the data in table "NoHostAvailable:"

I have created the keyspace and also created a table using Cassandra 3.0 server. I am using the 3 nodes architecture. And three of the servers are working and able to connect the 3 nodes. However when i insert or selecting the data using the CQL, Its showing the error saying that "NoHostAvailable:". Please could anyone provide me the reason and solution for this issue.
Topology
nodetool status output
UN 172.30.1.7 230.22 KB 256 ? 2103dcd3-f09b-47da-a187-bf28b42b918e rack1
DN 172.30.1.20 ? 256 ? 683db65d-0836-40e4-ab5b-fa0db20bae30 rack1
DN 172.30.1.2 ? 256 ? 2b1f15d1-2f92-41ef-a03e-0e5f5f578cf4 rack1
Schema
Keyspace
CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2};
Table
CREATE TABLE testrep(id INT PRIMARY KEY);
Note that from nodetool status, 2 out of your 3 node cluster is down(DN).
You might be inserting with a Consistency Level that cannot be satisfied.
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 127.0.0.1 237.31 MiB 256 ? 3c8a8d8d-992c-4b7c-a220-6951e37870c6 rack1
cassandra#cqlsh> create KEYSPACE qqq WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
cassandra#cqlsh> use qqq;
cassandra#cqlsh:qqq> CREATE TABLE testrep(id INT PRIMARY KEY);
cassandra#cqlsh:qqq> insert into testrep (id) VALUES ( 1);
cassandra#cqlsh:qqq> CONSISTENCY
Current consistency level is ONE.
cassandra#cqlsh:qqq> CONSISTENCY TWO ;
Consistency level set to TWO.
cassandra#cqlsh:qqq> insert into testrep (id) VALUES (2);
NoHostAvailable:
cassandra#cqlsh:qqq> exit

Howto debug why hints doesn't get processed after all nodes are up again

Did some extended maintenance on a node d1r1n3 out of a 14x node dsc 2.1.15 cluster today, but finished well within the cluster's max hint window.
After bringing the node back up most other nodes' hints disappeared again within minutes except for two nodes (d1r1n4 and d1r1n7), where only part of the hints went away.
After few hours of still showing 1 active hintedhandoff task I restarted node d1r1n7 and then quickly d1r1n4 emptied its hint table.
Howto see for which node stored hints on d1r1n7 are destined?
And possible howto get hints processed?
Update:
Found later corresponding to end-of-maxhint-window after taking node d1r1n3 offline for maintenance that d1r1n7' hints had vanished. Leaving us with a confused feeling of whether this was okay or not. Had the hinted been processed okay or some how just expired after end of maxhint window?
If the latter would we need to run a repair on node d1r1n3 after it's mainenance (this takes quite some time and IO... :/) What if we now applied read [LOCAL]QUORUM instead of as currently read ONE w/one DC and RF=3, could this then trigger read path repairs on needed-basis and maybe spare us is this case for a full repair?
Answer: turned out hinted_handoff_throttle_in_kb was # default 1024 on these two nodes while rest of cluster were # 65536 :)
hints are stored in cassandra 2.1.15 in system.hints table
cqlsh> describe table system.hints;
CREATE TABLE system.hints (
target_id uuid,
hint_id timeuuid,
message_version int,
mutation blob,
PRIMARY KEY (target_id, hint_id, message_version)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (hint_id ASC, message_version ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'hints awaiting delivery'
AND compaction = {'enabled': 'false', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
the target_id correlated with the node id
for example
in my sample 2 node cluster with RF=2
nodetool status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.47 KB 256 100.0% d00c4b10-2997-4411-9fc9-f6d9f6077916 rack1
DN 127.0.0.2 75.4 KB 256 100.0% 1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa rack1
I executed the following while node2 was down
cqlsh> insert into ks.cf (key,val) values (1,1);
cqlsh> select * from system.hints;
target_id | hint_id | message_version | mutation
--------------------------------------+--------------------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa | e80a6230-ec8c-11e6-a1fd-d743d945c76e | 8 | 0x0004000000010000000101cfb4fba0ec8c11e6a1fdd743d945c76e7fffffff80000000000000000000000000000002000300000000000547df7ba68692000000000006000376616c0000000547df7ba686920000000400000001
(1 rows)
as can be seen the system.hints.target_id correlates with host id in nodetool status (1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa)

Cassandra lack of scallability

I have got a problem with scalability Cassandra database. In spite of increase the number of nodes from 2 to 8, performance of database doesn't grow.
Cassandra Version: 3.7
Cassandra Hardware x8: 1vCPU 2.5 Ghz, 900 MB RAM, SSD DISK 20GB, 10 Gbps LAN
Benchmark Hardware x1: 16vCPU 2.5 GHz, 8 GB RAM, SSD DISK 5GB, 10 Gbps LAN
Default settings were changed in cassandra.yaml:
cluster_name: 'tst'
seeds: "192.168.0.101,192.168.0.102,...108"
listen_address: 192.168.0.xxx
endpoint_snitch: GossipingPropertyFileSnitch
rpc_address: 192.168.0.xxx
concurrent_reads: 8
concurrent_writes: 8
concurrent_counter_writes: 8
Keyspace:
create keyspace tst WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '2' };
Example table:
CREATE TABLE shares (
c1 int PRIMARY KEY,
c2 varchar,
c3 int,
c4 int,
c5 int,
c6 varchar,
c7 int
);
Examplary query used in tests:
INSERT INTO shares (c1, c1, c3, c4, c5, c6, c7) VALUES (%s, '%s', %s, %s, %s, '%s', %s)
For connect with base I will use https://github.com/datastax/java-driver. In multi-threads I use one of cluster object and one of session object according to the instructions. Connecting:
PoolingOptions poolingOptions = new PoolingOptions();
poolingOptions.setConnectionsPerHost(HostDistance.LOCAL, 5, 300);
poolingOptions.setCoreConnectionsPerHost(HostDistance.LOCAL, 10);
poolingOptions.setPoolTimeoutMillis(5000);
QueryOptions queryOptions = new QueryOptions();
queryOptions.setConsistencyLevel(ConsistencyLevel.QUORUM);
Builder builder = Cluster.builder();
builder.withPoolingOptions(poolingOptions);
builder.withQueryOptions(queryOptions);
builder.withLoadBalancingPolicy(new RoundRobinPolicy());
this.setPoints(builder); // here all of the nodes are added
Cluster cluster = builder.build()
Code of query:
public ResultSet execute(String query) {
ResultSet result = this.session.execute(query);
return result;
}
During test work, using of memory on all of the nodes is 80%, and CPU 100%. I am surprised by using connections in monitor (is too low):
[2016-09-10 09:39:51.537] /192.168.0.102:9042 connections=10, current load=62, max load=10240
[2016-09-10 09:39:51.556] /192.168.0.103:9042 connections=10, current load=106, max load=10240
[2016-09-10 09:39:51.556] /192.168.0.104:9042 connections=10, current load=104, max load=10240
[2016-09-10 09:39:51.556] /192.168.0.101:9042 connections=10, current load=196, max load=10240
[2016-09-10 09:39:56.467] /192.168.0.102:9042 connections=10, current load=109, max load=10240
[2016-09-10 09:39:56.467] /192.168.0.103:9042 connections=10, current load=107, max load=10240
[2016-09-10 09:39:56.467] /192.168.0.104:9042 connections=10, current load=115, max load=10240
[2016-09-10 09:39:56.468] /192.168.0.101:9042 connections=10, current load=169, max load=10240
[2016-09-10 09:40:01.468] /192.168.0.102:9042 connections=10, current load=113, max load=10240
[2016-09-10 09:40:01.468] /192.168.0.103:9042 connections=10, current load=84, max load=10240
[2016-09-10 09:40:01.468] /192.168.0.104:9042 connections=10, current load=92, max load=10240
[2016-09-10 09:40:01.469] /192.168.0.101:9042 connections=10, current load=205, max load=10240
Code of the monitor: https://github.com/datastax/java-driver/tree/3.0/manual/pooling#monitoring-and-tuning-the-pool
I am trying to test scalability of few NoSQL databases. In case of Redis base it was linear scalability, here she is not at all and I don't know why. Thanks for your help!
1GB RAM on each machine is a very low target. This could be causing too much GC pressure. Check your log to see the GC activity and try to understand if this 100% CPU cap is due to JVM GC'ing all the time.
Another quirk: how many threads are you running on each machine? If you are trying to scale with this code (your code):
Code of query:
public ResultSet execute(String query) {
ResultSet result = this.session.execute(query);
return result;
}
then you won't go very far. Synchronous queries are hopelessly slow. Even if you try to use more threads then 1GB of RAM could be (I already know it is...) too low... You should probably write async queries, for both resource consumption and scalability.

Resources