I've just started learning about Cassandra. I have a doubt in NetworkTopologyStrategy. So as we know the nodes in Cassandra is divided based on the hash value of the partition keys. If so during a write operation with replication factor 3, the data is written in a total of 3 nodes.
1st in the node that falls under the hash key value. And the next two replicas is written in different Rack of the same Datacenter. Will this 2nd node has the same hash value index or will they have different.
If different, won't the data is written in another hash value index?
Pls provide some clarification on this..
All the copies will have the same hash value. Cassandra will use those hash values to compute binary hash trees, named Merkle trees, which are used by the repair to identify when they are discrepancies in the versions of the records.
If you are interested, there is an explanation of this process here.
The keyspace created with multiple DC's generally has the following structure:
CREATE KEYSPACE cycling
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3,
'datacenter2' : 2
}
AND DURABLE_WRITES = true ;
Note that the datacenter1 will have 3 replicas of row and 2 in datacenter2.
When data is written to Cassandra every DC will fulfill the replication factor as defined in the keyspace definition.
A write operation for the row will be done on the node which is responsible for the token for the given partition key. The replicas for that row will be written to two subsequent nodes in the cluster in a clockwise manner in the best case.
The same sequence is followed within the other DC, datacenter2 in this example but with only 2 copied of the row
Will this 2nd node has the same hash value index or will they have different?
The assignment of partition to nodes is done based on the tokens generated by partitioners and Murmur3Partitioner by default.
Related
Setting up the context:
Cassandra currently implements vnodes. 256 by default which is tweakable in the cassandra.yaml file
Vnodes as I understand are token-ranges/hash-ranges. Eg. (x...y], where y is the token number of the vnode. Each physical node in Cassandra is assigned random 256 tokens, and each of those tokens are the boundary value of a hash/token range. The tokens assigned are within the range of 2^-63 to 2^63-1 (the range of hash numbers which murmur3 has partitioner may generate). So far so good.
Question:
1. Is it that a token range(vnode) is a fixed range. Once set, this token range will be copied to other Cassandra nodes to satisfy the replication factor like a token range(vnode) being a fundamental chunk of data(tokens) which goes around together. Only in case of bootstrap of a new node in the cluster, this token range(vnode) might break apart to be assigned to other node.
Riding on the last proposition, (say the last proposition is true).
Then a vnode must only contain tokens which belong a given keyspace.
Because each keyspace(container of column family/tables) has a defined replication strategy and replication factor. And it is highly likely that replication factor of keyspaces in a Cassandra cluster will vary.
Consider an example. "system_schema" keyspace has a RF of 1 whereas I created a keyspace "test_ks" with RF 3. If a row of system_schema keyspace has a token number 2(say) and a row of my test_ks has token number 5(say).
these 2 tokens can't be placed in the same token range(vnode). If a vnode is consistent chunk of token ranges, say token 2 and 5 belong to vnode with token number 10. so vnode 10 has to be placed on 3 different physical nodes to satisfy the RF =3 for test_ks, but we are unnecessary placing token 2 on 3 different nodes whose RF is supposed to be 1.
Is this proposition correct that, a vnode is only dedicated to a given keyspace?
which boils down to out of 256 tokens on a physical node... 20(say) vnodes currently belong to "system" keyspace, 80 vnodes(say) belong to test_ks.
Again riding on the above proposition, this means that each node should have the info of keyspace-wise vnodes currently available in the cluster.
That way when a new write comes in for a Keyspace the co-ordinator node would locate all vnodes in the cluster for that keyspace and assign the new row a token number which falls within the token range of those keyspaces. That being the case can I know how many vnodes currently belong to a keyspace in the entire cluster/ or on a given node.
Please do correct me if I'm wrong.
I have been following the below blogs and videos to get an understanding of this concept:
https://www.scribd.com/document/253239514/Virtual-Nodes-Strategies-for-Apache-Cassandra
https://www.youtube.com/watch?v=GddZ3pXiDys&t=11s
Thanks in advance
There is no fixed token-range, the tokens are just generated randomly. This is one of the reasons that vnodes were implemented - the idea being that if there are more tokens it is more likely that the resulting token-ranges will be more evenly distributed across nodes.
Token generation was recently improved in 3.0, allowing Cassandra to place new tokens a little more intelligently (see CASSANDRA-7032). You can also manually configure tokens (see initial_token), although it can become tricky to keep things balanced when it comes time to expand the cluster unless you plan on doubling the number of nodes.
The total number of tokens in a cluster is the number of nodes in the cluster multiplied by the number of vnodes per node.
In regards to placement of replicas, the first copy of a partition is placed in the node that owns that partition's token. The additional n copies are placed sequentially on the next n nodes in the ring that are in the same data centre. There is no relationship between tokens and keyspaces.
When a new write comes into a coordinator node, the coordinator node determines which node owns the partition by hashing the partition key. Note that for better performance this can actually be done by the driver instead if you use TokenAwarePolicy. The coordinator sends the write to the node that owns the partition, and if the data needs to be replicated the coordinator node also writes the replicas to the next two nodes sequentially in the token-space.
For example, suppose that we have 3 nodes which each have one token: node1: 10, node2: 20 & node3: 30. If we write a record whose partition key hashes to 22, to a keyspace with RF3, then the first copy goes to node2, the second goes to node3 and the third goes to node1. Note that each replica is equally valid - there is nothing special about the "first" replica other than that it happens to be stored on the "first" replica node.
Vnodes do not change this process, they just split up each node's token ranges by allowing each node to have more than one token. For example, if our cluster now has 2 vnodes for each node, it might instead look like this: node1: 10, 25, node2: 20, 3 & node3: 30, 21. Now our write that hashed to 22 goes to node3 (because it owns the range from 21-24), and the copies go to node1 and node2.
In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views
I have a development cassandra cluster of two cassandra nodes [Let's call them NodeA and NodeB]. I also have a script that is continuously sending data on NodeA. I have created the database with the following parameters:
CREATE KEYSPACE test_database WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Now, for some reason NodeB is stoping after some time. But the issue is, as soon as NodeB stops, the script that is sending data to NodeA starts giving data insertion error.
Can anyone point out a probable reason for the same.
Update: Both the nodes are seed nodes.
How Cassandra handle data repartition
Each key in cassandra can be converted to a token. When you install your cluster, the nodes calculate what range of token they will accept.
Let's take a simple example:
You have two nodes, and a token that goes from 0 to 9. A simple repartition would be: node A stores every token between 0-4 and node B stores every token between 5-9.
How Cassandra works for write
You choose a Coordinator (in your case node A), that receive the data. This node will then calculate a token. As seen in the first example, every node has a range of token assigned to it. So imagine the key is converted to token 4, then the data goes to node A (here the coordinator). If the token is 8, the data will be sent to node B.
What is cassandra data replication factor
The replication factor is how many time your data will be stored on your cluster. For a single database with no racks (your case), the data is first send to the node who owns the token associated with the key, and the replicas are sent to the next node in the topology.
In case of failure of one node, the replicas will help the node to restore its data.
In your case, there are no replicas, and if a node is down, Cassandra can't store the data and throws an error. If you have replication factor 2, Cassandra should be able to store a replica on node A and not fail.
Cassandra's Replication Factor:
Lets say we have 'n' as replication factor which means given input data will be stored/retrieved from 'n' nodes.
t
If you mention the replication factor as '1' which means only one node will have the data.
Partitioning:
Lets say we have 2 nodes, whenever you are inserting the data. Both these nodes will have some data, based on partitioning algorithm mentioned.
For example:
You are inserting 10 records, based on the hashing and partitioning algorithm, it chooses which node needs to be written for each record. Of-course the identification of node is done by the Coordinator :)
Durable Writes:
By default, cassandra always write in commit-log before flushing to disk. If you set to false, it will bypass commit-log and write directly to disk(SSTable).
The problem you have mentioned, for example lets say you are inserting 10 rows.
For simplicity, we can make the partitioning/hashing calculation as n/2.
So, Cassandra's Coordinator node splits up your data into two pieces(for simple calculation it will be 10/2) and tries to put 1st half in to 1st node and succeeds and tries to put the 2nd half into the second node(writing to commit-log), since it is unavailable it is throwing error.
So how do we fix this issue? lets say I want to batch insert multiple insert queries when 1 node in a cluster is down? It returns me
Connection to Cassandra cluster associated with connection cs1 not available due to Host not available. Host Address: cassandra1
If your table is not counter table , you can use consistency level of ANY which gives high availaiblity for write.
Refer this to learn more about it => https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08
This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).
There is one particular table in the system that I actually want to stay unique on a per server basis.
i.e. http://server1.example.com/some_stuff.html and http://server2.example.com/some_stuff.html should store and show data unique to that particular server. If one of the server dies, that table and its data goes with it.
I think CQL does not support table-level replication factors (see available create table options). One alternative is to create a key-space with a replication factor = 1:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'<strategy>' [,'<option>':<val>]};
Example:
To create a keyspace with SimpleStrategy and "replication_factor" option
with a value of "1" you would use this statement:
CREATE KEYSPACE <ksname>
WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
then all tables created in that key-space will have no replication.
If you would like to have a table for each node, then I think Cassandra does not directly support that. One work-around is to start an additional Cassandra cluster for each node where each cluster only have one node.
I am not sure why you would want to do that, but here is my take on this:
Distribution of the actual your data among the nodes in your Cassandra cluster is determined by the row key.
Just setting the replication factor to 1 will not put all data from one column family/table on 1 node. The data will still be split/distributed according to your row key.
Exactly WHERE your data will be stored is determined by the row key along with the partitioner as documented here. This is an inherent part of a DDBS and there is no easy way to force this.
The only way I can think of to have all the data for one server phisically in one table on one node, is:
use one row key per server and create (very) wide rows, (maybe using composite column keys)
and trick your token selection so that all row key tokens map to the node you are expecting (http://wiki.apache.org/cassandra/Operations#Token_selection)