I have a development cassandra cluster of two cassandra nodes [Let's call them NodeA and NodeB]. I also have a script that is continuously sending data on NodeA. I have created the database with the following parameters:
CREATE KEYSPACE test_database WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Now, for some reason NodeB is stoping after some time. But the issue is, as soon as NodeB stops, the script that is sending data to NodeA starts giving data insertion error.
Can anyone point out a probable reason for the same.
Update: Both the nodes are seed nodes.
How Cassandra handle data repartition
Each key in cassandra can be converted to a token. When you install your cluster, the nodes calculate what range of token they will accept.
Let's take a simple example:
You have two nodes, and a token that goes from 0 to 9. A simple repartition would be: node A stores every token between 0-4 and node B stores every token between 5-9.
How Cassandra works for write
You choose a Coordinator (in your case node A), that receive the data. This node will then calculate a token. As seen in the first example, every node has a range of token assigned to it. So imagine the key is converted to token 4, then the data goes to node A (here the coordinator). If the token is 8, the data will be sent to node B.
What is cassandra data replication factor
The replication factor is how many time your data will be stored on your cluster. For a single database with no racks (your case), the data is first send to the node who owns the token associated with the key, and the replicas are sent to the next node in the topology.
In case of failure of one node, the replicas will help the node to restore its data.
In your case, there are no replicas, and if a node is down, Cassandra can't store the data and throws an error. If you have replication factor 2, Cassandra should be able to store a replica on node A and not fail.
Cassandra's Replication Factor:
Lets say we have 'n' as replication factor which means given input data will be stored/retrieved from 'n' nodes.
t
If you mention the replication factor as '1' which means only one node will have the data.
Partitioning:
Lets say we have 2 nodes, whenever you are inserting the data. Both these nodes will have some data, based on partitioning algorithm mentioned.
For example:
You are inserting 10 records, based on the hashing and partitioning algorithm, it chooses which node needs to be written for each record. Of-course the identification of node is done by the Coordinator :)
Durable Writes:
By default, cassandra always write in commit-log before flushing to disk. If you set to false, it will bypass commit-log and write directly to disk(SSTable).
The problem you have mentioned, for example lets say you are inserting 10 rows.
For simplicity, we can make the partitioning/hashing calculation as n/2.
So, Cassandra's Coordinator node splits up your data into two pieces(for simple calculation it will be 10/2) and tries to put 1st half in to 1st node and succeeds and tries to put the 2nd half into the second node(writing to commit-log), since it is unavailable it is throwing error.
So how do we fix this issue? lets say I want to batch insert multiple insert queries when 1 node in a cluster is down? It returns me
Connection to Cassandra cluster associated with connection cs1 not available due to Host not available. Host Address: cassandra1
If your table is not counter table , you can use consistency level of ANY which gives high availaiblity for write.
Refer this to learn more about it => https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08
Related
Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.
I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Based on above description when i use murmur3partitioner,
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
when i read request to node 1 , will it internally connect with node 2 for consistency ?
And my intention is to make a replica and both nodes should server the request independently without inter communication.
First of all, please try to ask only one question at per post.
I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Incorrect. A RF=1 indicates that your entire cluster will have 1 copy of the data.
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
That is what it will try to do. Do note that it probably won't be exact. It'll probably be something like 49/51-ish.
when i read request to node 1 , will it internally connect with node 2 for consistency ?
With a RF=1, no it will not. Based on the hashed token value of your partition key, it will be directed only to the node which contains the data.
As an example, with a RF=2 with 2 nodes, it would depend on the consistency level set for your operation. Reading at ONE will always read only one replica. Reading at QUORUM will always read from 2 replicas with 2 nodes (after all, QUORUM of 2 equals 2). Reading at ALL will require a response from all replicas, and initiate a read repair if they do not agree.
Important to note, but you cannot force your driver to connect to a specific Cassandra node. You may provide one endpoint, but it will find the other via gossip, and use it as it needs to.
I have a three node Cassandra (DSE) cluster where I don't care about data loss so I've set my RF to 1. I was wondering how Cassandra would respond to read/write requests if a node goes down (I have CL=ALL in my requests right now).
Ideally, I'd like these requests to succeed if the data exists - just on the remaining available nodes till I replace the dead node. This keyspace is essentially a really huge cache; I can replace any of the data in the event of a loss.
(Disclaimer: I'm a ScyllaDB employee)
Assuming your partition key was unique enough, when using RF=1 each of your 3 nodes contains 1/3 of your data. BTW, in this case CL=ONE/ALL is basically the same as there's only 1 replica for your data and no High Availability (HA).
Requests for "existing" data from the 2 up nodes will succeed. Still, when one of the 3 nodes is down a 1/3 of your client requests (for the existing data) will not succeed, as basically 1/3 of you data is not available, until the down node comes up (note that nodetool repair is irrelevant when using RF=1), so I guess restore from snapshot (if you have one available) is the only option.
While the node is down, once you execute nodetool decommission, the token ranges will re-distribute between the 2 up nodes, but that will apply only for new writes and reads.
You can read more about the ring architecture here:
http://docs.scylladb.com/architecture/ringarchitecture/
I know that Cassandra have different read consistency levels but I haven't seen a consistency level which allows as read data by key only from one node. I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read. Even if we choose a consistency level of one we will ask all nodes but wait for the first response from any node. That is why we will load not only one node when we read but 3 (4 with a coordinator node). I think we can't really improve a read performance even if we set a bigger replication factor.
Is it possible to read really only from a single node?
Are you using a Token-Aware Load Balancing Policy?
If you are, and you are querying with a consistency of LOCAL_ONE/ONE, a read query should only contact a single node.
Give the article Ideology and Testing of a Resilient Driver a read. In it, you'll notice that using the TokenAwarePolicy has this effect:
"For cases with a single datacenter, the TokenAwarePolicy chooses the primary replica to be the chosen coordinator in hopes of cutting down latency by avoiding the typical coordinator-replica hop."
So here's what happens. Let's say that I have a table for keeping track of Kerbalnauts, and I want to get all data for "Bill." I would use a query like this:
SELECT * FROM kerbalnauts WHERE name='Bill';
The driver hashes my partition key value (name) to the token of 4639906948852899531 (SELECT token(name) FROM kerbalnauts WHERE name='Bill'; returns that value). If I am working with a 6-node cluster, then my primary token ranges will look like this:
node start range end range
1) 9223372036854775808 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to 1844674407370955161
5) 1844674407370955162 to 5534023222112865484
6) 5534023222112865485 to 9223372036854775807
As node 5 is responsible for the token range containing the partition key "Bill," my query will be sent to node 5. As I am reading at a consistency of LOCAL_ONE, there will be no need for another node to be contacted, and the result will be returned to the client...having only hit a single node.
Note: Token ranges computed with:
python -c'print [str(((2**64 /5) * i) - 2**63) for i in range(6)]'
I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read
Wrong, with Consistency Level ONE the coordinator picks the fastest node (the one with lowest latency) to ask for data.
How does it know which replica is the fastest ? By keeping internal latency stats for each node.
With consistency level >= QUORUM, the coordinator will ask for data from the fastest node and also asks for digest from other replicas
From the client side, if you choose the appropriate load balancing strategy (e.g. TokenAwareStrategy) the client will always contact the primary replica when using consistency level ONE
I've read quite a few articles and a lot of question/answers on SO about Cassandra but I still can't figure out how Cassandra decides which node(s) to go to when it's reading the data.
First, some assumptions about an imaginary cluster:
Replication Strategy = simple
Using Random Partitioner
Cluster of 10 nodes
Replication Factor of 5
Here's my understanding of how writes work based on various Datastax articles and other blog posts I've read:
Client sends the data to a random node
The "random" node is decided based on the MD5 hash of the primary key.
Data is written to the commit_log and memtable and then propagated 4 times (with RF = 5).
The 4 next nodes in the ring are then selected and data is persisted in them.
So far, so good.
Now the question is, when the client sends a read request (say with CL = 3) to the cluster, how does Cassandra know which nodes (5 out of 10 as the worst case scenario) it needs to contact to get this data? Surely it's not going to all 10 nodes as that would be inefficient.
Am I correct in assuming that Cassandra will again, do an MD5 hash of the primary key (of the request) and choose the node according to that and then walks the ring?
Also, how does the network topology case work? if I have multiple data centers, how does Cassandra know which nodes in each DC/Rack contain the data? From what I understand, only the first node is obvious (since the hash of the primary key has resulted in that node explicitly).
Sorry if the question is not very clear and please add a comment if you need more details about my question.
Many thanks,
Client sends the data to a random node
It might seem that way, but there is actually a non-random way that your driver picks a node to talk to. This node is called a "coordinator node" and is typically chosen based-on having the least (closest) "network distance." Client requests can really be sent to any node, and at first they will be sent to the nodes which your driver knows about. But once it connects and understands the topology of your cluster, it may change to a "closer" coordinator.
The nodes in your cluster exchange topology information with each other using the Gossip Protocol. The gossiper runs every second, and ensures that all nodes are kept current with data from whichever Snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to.
In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line. Although if you are using vnodes, that will be trickier to ascertain, as data on all 256 (default) virtual nodes will quickly flash by on the screen.
So let's say that I have a table that I'm using to keep track of ship crew members by their first name, and let's assume that I want to look-up Malcolm Reynolds. Running this query:
SELECT token(firstname),firstname, id, lastname
FROM usersbyfirstname WHERE firstname='Mal';
...returns this row:
token(firstname) | firstname | id | lastname
----------------------+-----------+----+-----------
4016264465811926804 | Mal | 2 | Reynolds
By running a nodetool ring I can see which node is responsible for this token:
192.168.1.22 rack1 Up Normal 348.31 KB 3976595151390728557
192.168.1.22 rack1 Up Normal 348.31 KB 4142666302960897745
Or even easier, I can use nodetool getendpoints to see this data:
$ nodetool getendpoints stackoverflow usersbyfirstname Mal
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
192.168.1.22
For more information, check out some of the items linked above, or try running nodetool gossipinfo.
Cassandra uses consistent hashing to map each partition key to a token value. Each node owns ranges of token values as its primary range, so that every possible hash value will map to one node. Extra replicas are then kept in a systematic way (such as the next node in the ring) and stored in the nodes as their secondary range.
Every node in the cluster knows the topology of the entire cluster, such as which nodes are in which data center, where they are in the ring, and which token ranges each nodes owns. The nodes get and maintain this information using the gossip protocol.
When a read request comes in, the node contacted becomes the coordinator for the read. It will calculate which nodes have replicas for the requested partition, and then pick the required number of nodes to meet the consistency level. It will then send requests to those nodes and wait for their responses and merge the results based on the column timestamps before sending the result back to the client.
Cassandra will locate any data based on a partition key that is mapped to a token value by the partitioner. Tokens are part of a finite token ring value range where each part of the ring is owned by a node in the cluster. The node owning the range of a certain token is said to be the primary for that token. Replicas will be selected by the data replication strategy. Basically this works by going clockwise in the token ring, starting from the primary, and stopping depending on the number of required replicas.
What's important to realize is that each node in the cluster is able to identify the nodes responsible for a certain key based on the logic described above. Whenever a value is written to the cluster, the node accepting the request (the coordinator node) will know right away the nodes that need to execute the write.
In case of multiple data-centers, all keys will be mapped across all DCs to the exact same token determined by the partitioner. Cassandra will try to write to each DC and each DC's replicas.