Data Re-Partitioning in Cassandra - cassandra

As a follow up of this Data Partition in Cassandra, I got the idea of the vNodes. Thanks to 'Simon Fontana Oscarsson'
When I try to explore the data partitioning using vNodes, I have few questions,
I try to observe the partition distribution in 2 node (./nodetool ring)
Two seed nodes (2 node)
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9207297847862311651
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9185516104965672922
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9135483708655236026
172.30.56.60 rack1 Up Normal 102.77 KiB 100.00% -9106737079237505681
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9078521344187921602
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9051897156173923903
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9049800264451581717
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -9039572936575206977
172.30.56.60 rack1 Up Normal 102.77 KiB 100.00% -9019927187583981555
172.30.56.60 rack1 Up Normal 102.77 KiB 100.00% -9006071175095726599
172.30.56.60 rack1 Up Normal 102.77 KiB 100.00% -8995415525773810853
172.30.56.60 rack1 Up Normal 102.77 KiB 100.00% -8949342263103866059
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -8880432529087253108
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -8859265089807316498
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -8844286905987198633
172.30.56.61 rack1 Up Normal 105.19 KiB 100.00% -8832739468389117376
So as per my observation in two nodes, The Node 61 has the values from -9207297847862311651 to -9185516104965672922 as one range goes on...
NOTE : Partition range from '-9039572936575206977' to '-9019927187583981554' is currently handled by Node 61.
Now I am adding one more node to the cluster (not the seed node) and I ran ./nodetool ring,
Two seed nodes with one new node (3 node)
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9207297847862311651
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9185516104965672922
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9135483708655236026
172.30.56.60 rack1 Up Normal 102.77 KiB 63.57% -9106737079237505681
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9078521344187921602
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9051897156173923903
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9049800264451581717
172.30.56.61 rack1 Up Normal 104.12 KiB 64.73% -9039572936575206977
172.30.56.62 rack1 Up Normal 103.7 KiB 71.70% -9031848008695747480
172.30.56.62 rack1 Up Normal 103.7 KiB 71.70% -9028974600706382491
172.30.56.60 rack1 Up Normal 102.77 KiB 63.57% -9019927187583981555
Now I observed that same partition range is given to the new node Node 62,
i.e, range from -9039572936575206977 to -9031848008695747480 is handled by Node 61 but -9031848008695747480 to -9019927187583981555 is handled by Node 62 (New node),
1) So does this mean that, adding a new node in the cluster will distribute the existing partition range?
2) Is there a way to observe the replicated partitions in Cassandra using any utility like nodetool?
3) I have 3 nodes with RF as 2, How to see the data's available in a node 62 alone?

1) When adding a node Cassandra will start by choosing good ranges for the new node to take over. It will then create 256 new token ranges that are just portions of the already existing ones. This means the new node takes tokens from many nodes (instead of only one per RF when not using vnodes) in the cluster which makes streaming alot faster.
2 and 3) Does this answer your questions? determine node of a partition in Cassandra

Related

Cassandra Data Centers and Cluster(s) Ring(s) relation

I have a Cassandra cluster with 8 nodes in 2 datacenters respectively 4-4 nodes in DC1 and DC2.
I've created a keyspace:
CREATE KEYSPACE mykeyspace
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'DC1' : 2,
'DC2' : 2,
};
As far as I understand, both DC1 and DC2 will have all the data, with other words in case of whole DC1 goes offline, DC2 will capable to serve all data.
Question
Should we say that based on the previous fact both DC1 and DC2 are a "complete" ring in their own? (regarding the whole hash -2^63-1 .. +2^63 will be presented by nodes on DC1 and the same is true for DC2)
Why I am asking this?
My answer would be no, this is still one cluster, so one ring, regardless there are two subset of nodes which are contain all the data. However many image and illustrations represent the nodes in the two datacenters with two "circles" which hints the term two "rings". (obviously not two clusters)
see for example:
DataStax: Multiple datacenter write requests
PS: If it is possible do not bring to the picture the consistency levels. I understand that the inter node communication workflow depends on if the operation is write or read, and also depends on the consistency level.
A practical question which depends on the answer:
Say in DC1 num_tokens: 256 for all nodes and DC2 num_tokens: 32 for all nodes. Those numbers will be relative to each other if the 8 node are in one token ring, but in case of DC1 and DC2 are two separate token rings those number (256 and 32) are nothing to do with each other...
Look, if you use SimpleStrategy it work as just one ring.
If you use NetworkTopology looks like two or more rings. You can use nodetool ring to see tokens vs nodes and nodetool getendpoints keyspace table partition_key to see where your partition key will be located.
[root#ip-20-0-1-226 ~]# nodetool ring
Datacenter: dc1
==========
Address Rack Status State Load Owns Token
8037128101152694619
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% -9050061154907259251
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% -8877859671879922723
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% -8871087231721285506
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% -8594840449446657067
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% -2980375791196469732
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% -2899706862324328975
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% -2406342150306062345
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% -2029972788998320465
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% -1666526652028070649
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% 1079561723841835665
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% 1663305819374808009
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% 4099186620247408174
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% 5181974457141074579
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% 6403842400328155928
20.0.1.226 rack2 Up Normal 608.99 MiB 58.34% 6535209989509674611
20.0.1.50 rack1 Up Normal 456.32 MiB 41.66% 8037128101152694619
[root#ip-20-0-1-44 ~]# nodetool ring
Datacenter: dc1
==========
Address Rack Status State Load Owns Token
8865515588426899552
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% -5830638745978850993
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% -4570936939416887314
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% -4234199013293852138
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% -3212848663801274832
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% -2683544040240894822
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% 6070021776298348267
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% 7319793018057117390
20.0.1.44 rack1 Up Normal 337.81 MiB 100.00% 8865515588426899552
Datacenter: dc2
==========
Address Rack Status State Load Owns Token
7042359221330965349
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% -6507323776677663977
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% -2315356636250039239
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% -2097227748877766854
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% -630561501032529888
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% 2580829093211157045
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% 4687230732027490213
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% 4817758060672762980
20.0.1.150 rack1 Up Normal 323.66 MiB 100.00% 7042359221330965349
[root#ip-20-0-1-44 ~]# nodetool getendpoints qa eventsrawtest "host1","2019-03-29","service1"
20.0.1.150
20.0.1.44
CREATE KEYSPACE qa WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1', 'dc2': '1'} AND durable_writes = true;
CREATE TABLE eventsrawtest (
host text,
bucket_time text,
service text,
time timestamp,
metric double,
state text,
PRIMARY KEY ((host, bucket_time, service), time)
) WITH CLUSTERING ORDER BY (time DESC)
The short answer is: Both DC will have 2 replicas. Then 4 replicas for your data.
Cassandra is smart enough to understand your topology and distribute data.
It's also important distribute data between racks (rack awareness), since Cassandra will write one replica in each rack. Then you will be sure that your data is spread and you can loose up to 6 nodes without losing data (considering all your keyspaces with mentioned replication factor).
DC1
- rack1
-- 2 nodes
- rack2
-- 2 nodes
DC2
- rack1
-- 2 nodes
- rack2
-- 2 nodes
Finally, you can have distinct num_tokens between DCs. It will not affect replication factor.
If you can check doc, it's recommended a smaller value.
https://cassandra.apache.org/doc/latest/cassandra/getting_started/production.html

Is it possible to speed up Cassandra cleanup process?

I have Cassandra 3.11.1.0 cluster (6 nodes) and cleanup was not done after 2 nodes were joined.
I started nodetool cleanup on first node (192.168.20.197) and cleanup is in progress almost 30 days.
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.20.109 33.47 GiB 256 ? 677dc8b6-eb00-4414-8d15-9f1c79171069 rack1
UN 192.168.20.47 35.41 GiB 256 ? df8c1ee0-fabd-404e-8c55-42531b89d462 rack1
UN 192.168.20.98 20.65 GiB 256 ? 70ce02d7-779b-4b5a-830f-add6ed64bcc2 rack1
UN 192.168.20.21 33.03 GiB 256 ? 40863a80-5f25-464f-aa52-660149bc0070 rack1
UN 192.168.20.197 25.98 GiB 256 ? 5420eae3-e643-49e2-b2d8-703bd5a1f2d4 rack1
UN 192.168.20.151 21.9 GiB 256 ? be7d5df1-3edd-4bc3-8f34-867cb3b8bfca rack1
All nodes which were not cleaned are under load now, (CPU Load ~80-90% ) but new-joined (nodes 192.168.20.98 and 192.168.20.151 ) nodes have CPU Load ~10-20%
It looks like old nodes are loaded because of old data which can be cleaned up.
Each node has 61GB RAM and 8 CPU Cores. HEAP size is 30Gb
So, my questions are
Is it possible to speed up cleaning process?
Is CPU Load related to the old unused (which node is not owns
anymore) data on nodes?

Cassandra data files MUCH larger than expected

I just did an experiment in which I loaded around a dozen csv files, weighing in at around 5.2 GB (compressed). After they are uploaded to Cassandra, they take up 64 GB! (actually around 128 GB but that is due to replication factor being 2).
Frankly I expected Cassandra's data to take up even less than the original 5.2 GB csv because:
1. Cassandra should be able to store data (mostly numbers) in binary format instead of ascii
2. Cassandra should have split a single file into its column constituents and improved compression dramatically
I'm completely new to Cassandra and this was an experiment. It is entirely possible that I misunderstand the product or mis-configured it.
Is it expected that 5.2 GB csvs will end up as 64 GB cassandra files?
EDIT: Additional info:
[cqlsh 5.0.1 | Cassandra 2.1.11 | CQL spec 3.2.1 | Native protocol v3]
[~]$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN xx.x.xx.xx1 13.17 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx2 14.02 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx3 13.09 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx4 12.32 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx5 12.84 GB 256 ? HOSTID RAC1
UN xx.x.xx.xx6 12.66 GB 256 ? HOSTID RAC1
du -h [director which contains sstables before they are loaded]: 67GB

what is the meaning of owns% in cassandra and how to change it?

I've configured the cannsadra cluster (cassandra-1.1) of 4 instances.
I have 2 pc's and i'm running 2 instances at each pc.
The pc's are identical, and have 20G ram.
But, when I'm running nodetool it show me different Owns %. The question is WHY?
./bin/nodetool -p 8001 ring
Note: Ownership information does not include topology, please specify a keyspace.
Address DC Rack Status State Load Owns Token 51042355038140769519506191114765231718
172.16.40.32 datacenter1 rack1 Up Normal 11.12 KB 70.00% 0
127.0.0.2 datacenter1 rack1 Up Normal 11.31 KB 10.00% 17014118346046923173168730371588410572
172.16.40.202 datacenter1 rack1 Up Normal 6.7 KB 10.00% 34028236692093846346337460743176821145
127.0.0.3 datacenter1 rack1 Up Normal 11.18 KB 10.00% 51042355038140769519506191114765231718
my free -m looks on both machines:
total used free shared buffers cached
Mem: 20119 9621 10497 0 281 7925
-/+ buffers/cache: 1414 18704
Swap: 2894 2 2892
The percentage is determined by the token distribution across the nodes. The token range for Cassandra go from 0 to 2^127 (170141183460469231731687303715884105728). Your ring's tokens are not evenly distributed between 0 and 2^127 so that is why you have one node with 70% ownership. You can use nodetool move to get your ring in balance.
There is a simple python script on the Cassandra wiki that will generate evenly balanced tokens. I also wrote a simple tool to help visualize your ring topology.

Uneven data size on some cassandra nodes after extending and cleanup

1) We had cluster of 10 nodes, recently we added 20 more nodes to the cluster.
2) After addition we ran cleanup on all the necessary nodes.
3) In ring status "Effective-Owership" is properly balanced but "load" on two machine is different than rest of the machines.
rack1 Up Normal 196.38 GB 6.67%
rack2 Up Normal 195.33 GB 6.67%
rack1 Up Normal 191.57 GB 6.67%
rack2 Up Normal 197.83 GB 6.67%
rack1 Up Normal 190.92 GB 6.67%
rack2 Up Normal 194.59 GB 6.67%
rack1 Up Normal 195.66 GB 6.67%
rack2 Up Normal 191.45 GB 6.67%
rack1 Up Normal 197.13 GB 6.67%
rack2 Up Normal 196.19 GB 6.67%
rack1 Up Normal 195.39 GB 6.67%
rack2 Up Normal 199.35 GB 6.67%
rack1 Up Normal 197.71 GB 6.67%
rack2 Up Normal 194.22 GB 6.67%
rack1 Up Normal 192.83 GB 6.67%
rack2 Up Normal 197.17 GB 6.67%
rack1 Up Normal 192.61 GB 6.67%
rack2 Up Normal 193.88 GB 6.67%
rack1 Up Normal 197.3 GB 6.67%
rack2 Up Normal 196.74 GB 6.67%
rack1 Up Normal 194.89 GB 6.67%
rack2 Up Normal 198.47 GB 6.67%
rack1 Up Normal 197.26 GB 6.67%
rack2 Up Normal 345.34 GB 6.67%
rack1 Up Normal 195.68 GB 6.67%
rack2 Up Normal 263.23 GB 6.67%
rack1 Up Normal 190.72 GB 6.67%
rack2 Up Normal 198.98 GB 6.67%
rack1 Up Normal 194.22 GB 6.67%
rack2 Up Normal 191.95 GB 6.67%
4) On one machine load is 345GB and on other machine it is 263GB while on rest of the machine it is around 195GB.
5) We are using Cassandra-1.1.0 and I have run cleanup on these machine twice but it is not helping.
Any Idea how could I balance this cluster with same load on each node?
I had this problem of servers with way higher load than others.
What happened in my case is that bootstrapping failed for some reason, interrupting the streaming of data. When you resume it, the data streaming starts again from the beginning but the previous data is not removed and shows up in the output of nodetool status.
The easiest way for me is to just replace those nodes following this procedure for dead nodes : http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
EDIT : nodetool cleanup just removes keys not belonging to the node, that doesn't mean it frees disk space.

Resources