I am learning Cassandra data storage/model. Want to know if there is tool which can show which ROW KEY or PARTITION exists on which NODE.
say i have 3 node cluster with a keyspace with RF=2
Thanks in advance.
You can use the nodetool getendpoints command for that. Example
nodetool getendpoints MY_KEY
It will show you which nodes are replica for that data.
Addition information: You can also use nodetool ring command to show the token distribution. It's a bit harder to understand for newbies though.
A quick explanation for tokens if you're interested: Basically each node has responsibility for one (or multiple if using vnodes) token. The token range that this node owns goes from the token itself and back until the previous node token. This token range is replicated desired number of times for nodes that owns the following tokens (going forward).
This image explains it well:
Image source: https://www.datastax.com/dev/blog/repair-in-cassandra
Related
We have a 1 DC cluster running Cassandra 3.11. The DC has 8 nodes total with 16 tokens per node and 3 seed nodes. We use Murmur3Partitioner.
In order to ensure better data distribution for the upcoming cluster in another DC, we want to use the tokens allocation approach where you manually specify initial_token for seed nodes and use allocate_tokens_for_keyspace for non seed nodes.
The problem is that our current datacenter cluster is not well balanced, since we built the cluster without a tokens allocation approach. So currently this means that the tokens are not well distributed. I can't figure out how to calculate initial_token for the new seed nodes in the new Datacenter. I probably cannot consider the token range of the new cluster as independent and calculate the initial token range as I would for a fresh cluster. At this point I am very unsure how to proceed. Any help will be appreciated, thanks.
Currently, I am trying to make a concept of migration and have come to the part where I do not know what to do and the documentation is not helpful.
There are scripts available to calculate the initial_token value, for example, you could use the one here to quickly calculate these values:
https://www.geroba.com/cassandra/cassandra-token-calculator/
You do have the ability to set allocate_tokens_for_keyspace and point it to a keyspace with a replication factor you plan to use for user-created keyspaces in the cluster, if you're adding a new DC, then you probably already have such a keyspace, and this should help you get better distribution. Remember to set this before bootstrapping nodes to the new DC.
Another option would be to avoid using vnodes entirely and go with single token architecture by setting num_tokens to 1. This gives you the ability to bootstrap nodes to the new DC, load/stream data and then monitor the distribution and make changes as needed using 'nodetool move':
https://cassandra.apache.org/doc/3.11/cassandra/tools/nodetool/move.html
This method would require you to monitor the distribution and make changes to the token assignments as needed, and you'd want to follow-up the move command with 'nodetool repair' and 'nodetool cleanup' on all nodes, but it gives you the ability to rectify uneven distribution quickly without bootstrapping new nodes. You would still want to use the same method for calculating the initial_token values with single-token architecture and set that before bootstrap.
I suspect either method could work well for you, but wanted to give you a second option.
For testing here and describing the question simple, I used two nodes to build a cassandra(2.1.11) cluster (192.168.56.110 and 192.168.56.111), Now I added one additional node(192.168.56.112) to this cluster, and I hope my new ring balanced through using nodetool move command, but when I using the following steps:
Getting the 192.168.56.110's all token range, such as 981588427421702712 -- 1007755089748978774
Getting the new node's all token range, such as 5458173168911717635 -- 5458821955945522089
I executed the command:
[root#test-1 pengcz]# ../cassandra-2.1.11/bin/nodetool -h 192.168.56.110 -u admin -pw admin4587 move 5458173168911717635
error: target token 5458173168911717635 is already owned by another node.
-- StackTrace --
java.io.IOException: target token 5458173168911717635 is already owned by another node.
According to the article Load balancing said: If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command., I think I seemly understood nodetool move command wrong, But I don't know how to understand it and balance the new cluster? Any advice will be appreciated!
So, you changed settings num_tokens to 512. After starting new nodes you should remember that seed nodes doesn't bootstrap automatically, but other ones balanced data after start. To balance seed node you should run nodetool repair for you node. Time depends on data size.
Can I force a query to fetch from local itself. WE have two data centers with replication factor 3 and 3 and i want to see the replicaiton is done properly or not 1) across nodes and 2 ) across data centers. Can i force the query to check only from a particular node and see if data is present in that node? I know getendpoints will fetch if i give ids but if want to check table updates in general and see if the data is being replicated or not how can i do this? APart from local_quorum we have any other option? Thanks
I can understand why someone might want to do this (sanity check), and I'll put some instructions on how to do this below. However firstly, it's not really necessary; The reason is because Cassandra is a distributed system, under normal circumstances its not necessary to check data is on a given node, the replication will place a given row on a given node according to where the snitch determines placement. So for a given replication factor like DC1:3 and DC2:3 a row may be on any 3 nodes in each DC. As long as you can query the cluster as a whole or each DC and get the right results then you know that your replication is working ok.
Having said that, here is how you find a key, the caveat is it has to be flushed to disc (you might need to invoke a nodetool flush). It may seem convoluted but this is how you trace it to a sstable so you may find this useful to know:
Use nodetool getendpoints to locate the nodes the key is on
Use nodetool getsstables to find the sstables the key is present in
Locate the file on disc and then use sstable2json to view the contents of the table, or use sstablekeys to just view the keys
Note: the sstable2json tools are only listed under Cassandra 1.2 docs but they should still be present in 2.1, at least I verified they are and have used them in DSE4.7 and 4.8.
I've read quite a few articles and a lot of question/answers on SO about Cassandra but I still can't figure out how Cassandra decides which node(s) to go to when it's reading the data.
First, some assumptions about an imaginary cluster:
Replication Strategy = simple
Using Random Partitioner
Cluster of 10 nodes
Replication Factor of 5
Here's my understanding of how writes work based on various Datastax articles and other blog posts I've read:
Client sends the data to a random node
The "random" node is decided based on the MD5 hash of the primary key.
Data is written to the commit_log and memtable and then propagated 4 times (with RF = 5).
The 4 next nodes in the ring are then selected and data is persisted in them.
So far, so good.
Now the question is, when the client sends a read request (say with CL = 3) to the cluster, how does Cassandra know which nodes (5 out of 10 as the worst case scenario) it needs to contact to get this data? Surely it's not going to all 10 nodes as that would be inefficient.
Am I correct in assuming that Cassandra will again, do an MD5 hash of the primary key (of the request) and choose the node according to that and then walks the ring?
Also, how does the network topology case work? if I have multiple data centers, how does Cassandra know which nodes in each DC/Rack contain the data? From what I understand, only the first node is obvious (since the hash of the primary key has resulted in that node explicitly).
Sorry if the question is not very clear and please add a comment if you need more details about my question.
Many thanks,
Client sends the data to a random node
It might seem that way, but there is actually a non-random way that your driver picks a node to talk to. This node is called a "coordinator node" and is typically chosen based-on having the least (closest) "network distance." Client requests can really be sent to any node, and at first they will be sent to the nodes which your driver knows about. But once it connects and understands the topology of your cluster, it may change to a "closer" coordinator.
The nodes in your cluster exchange topology information with each other using the Gossip Protocol. The gossiper runs every second, and ensures that all nodes are kept current with data from whichever Snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to.
In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line. Although if you are using vnodes, that will be trickier to ascertain, as data on all 256 (default) virtual nodes will quickly flash by on the screen.
So let's say that I have a table that I'm using to keep track of ship crew members by their first name, and let's assume that I want to look-up Malcolm Reynolds. Running this query:
SELECT token(firstname),firstname, id, lastname
FROM usersbyfirstname WHERE firstname='Mal';
...returns this row:
token(firstname) | firstname | id | lastname
----------------------+-----------+----+-----------
4016264465811926804 | Mal | 2 | Reynolds
By running a nodetool ring I can see which node is responsible for this token:
192.168.1.22 rack1 Up Normal 348.31 KB 3976595151390728557
192.168.1.22 rack1 Up Normal 348.31 KB 4142666302960897745
Or even easier, I can use nodetool getendpoints to see this data:
$ nodetool getendpoints stackoverflow usersbyfirstname Mal
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
192.168.1.22
For more information, check out some of the items linked above, or try running nodetool gossipinfo.
Cassandra uses consistent hashing to map each partition key to a token value. Each node owns ranges of token values as its primary range, so that every possible hash value will map to one node. Extra replicas are then kept in a systematic way (such as the next node in the ring) and stored in the nodes as their secondary range.
Every node in the cluster knows the topology of the entire cluster, such as which nodes are in which data center, where they are in the ring, and which token ranges each nodes owns. The nodes get and maintain this information using the gossip protocol.
When a read request comes in, the node contacted becomes the coordinator for the read. It will calculate which nodes have replicas for the requested partition, and then pick the required number of nodes to meet the consistency level. It will then send requests to those nodes and wait for their responses and merge the results based on the column timestamps before sending the result back to the client.
Cassandra will locate any data based on a partition key that is mapped to a token value by the partitioner. Tokens are part of a finite token ring value range where each part of the ring is owned by a node in the cluster. The node owning the range of a certain token is said to be the primary for that token. Replicas will be selected by the data replication strategy. Basically this works by going clockwise in the token ring, starting from the primary, and stopping depending on the number of required replicas.
What's important to realize is that each node in the cluster is able to identify the nodes responsible for a certain key based on the logic described above. Whenever a value is written to the cluster, the node accepting the request (the coordinator node) will know right away the nodes that need to execute the write.
In case of multiple data-centers, all keys will be mapped across all DCs to the exact same token determined by the partitioner. Cassandra will try to write to each DC and each DC's replicas.
I've read the relevant documentation I could find, but I still have doubts.
What I read
From http://wiki.apache.org/cassandra/Operations#Moving_nodes
If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
and from http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity-to-an-existing-cluster
If you need to increase capacity by a non-uniform number of nodes, you must recalculate tokens for the entire cluster, and then use nodetool move to assign the new tokens to the existing nodes. After all nodes are restarted with their new token assignments, run a nodetool cleanup to remove unused keys on all nodes
But I'm not clear on the order of these things.
Could you explain how to do it in the following scenario?
I'm using cassandra 1.1.9, so no virtual nodes are in use.
I have a cluster ring with 5 nodes, and each owns 20%
Their tokens are
0
34028236692093846346337460743176821145
68056473384187692692674921486353642291
102084710076281539039012382229530463436
136112946768375385385349842972707284582
I want to add 2 additional nodes.
What steps do I have to follow? I know I should install and configure cassandra, use the original 5 as seeds, and calculate their new tokens, but in what order should I move the data using nodetool move? Is it one at a time?
What happens with the data when I move the first one? Is it available at all times?
Should I start the two new nodes before moving the original 5 to their new tokens?
A step by step guide would be ideal.
Please note that I need to do it pre version 1.2
The new tokens should be
0
24305883351495604533098186245126300818
48611766702991209066196372490252601636
72917650054486813599294558735378902454
97223533405982418132392744980505203272
121529416757478022665490931225631504090
145835300108973627198589117470757804908
calculated using 2^127/7 * {0-7}.
What steps do I have to follow?
in what order should I move the data using nodetool move?
You should
Bootstrap in one node at 48611766702991209066196372490252601636
Bootstrap the other node at 121529416757478022665490931225631504090
Move 34028236692093846346337460743176821145 to 24305883351495604533098186245126300818
Move 68056473384187692692674921486353642291 to 72917650054486813599294558735378902454
Move 102084710076281539039012382229530463436 to 97223533405982418132392744980505203272
Move 136112946768375385385349842972707284582 to 145835300108973627198589117470757804908
(I tried to minimise the amount of data transferred - might not be optimal but is close enough to not make much difference given the inbalance of data you probably have already.)
Is it one at a time?
You should bootstrap one node and once and move one token at once. This avoids placing excess load on the cluster while streaming data.
What happens with the data when I move the first one? Is it available at all times?
Data is fully available during the move. The node participates in reads and writes for the old and new range so you can read and write during the move.
Should I start the two new nodes before moving the original 5 to their new tokens?
Always better to have more nodes in the cluster - if you moved first, you'd have some nodes with twice as much data as the others.
From Cassandra 1.2, keeping a cluster balanced when adding nodes is very easy, because of the new vnodes (multiple seeds per node) feature. Cassandra now automatically balances the cluster for you. If you upgrade from an earlier version you will have to activate the vnode feature yourself