Consider following Cassandra setup:
ring of 6 nodes: A, B, D, E, F, G
replication factor: 3
partitioner: RandomPartitioner
placement strategy: SimpleStrategy
My Test-Column is stored on node B and replicated to nodes D and E.
Now I have multiple java processes reading my Test-Column trough Hector API (Thrift) with read CL.ONE
There are two possibilities:
Hector will forward all calls to node B, because B is the data
master
Hector will load balance read calls trough node B, D and E (master and replicates). In this case my test column would be loaded into cache on each Cassandra instance.
Which one is it 1) or 2) ?
Thanks and regards,
Maciej
I believe it is: 3) Cassandra forwards all calls to the closest node that is alive, where "closeness" is determined by the Snitch currently being used (set in cassandra.yaml).
SimpleSnitch chooses the closest node on the token ring.
AbstractNetworkTopologySnitch and derived snitches first try to choose nodes in the same rack, then nodes in the same datacenter.
If DynamicSnitch is enabled, it dynamically adjusts the node closeness returned by the underlying snitch, according to the nodes' recent performance.
See Cassandra ArchitectureInternals under "Read Path" for more information.
(Upvoted Theodore's answer because it is corect).
Some additional details:
We do nothing on the hector side to route traffic to a given node based on the key (yet). This was referred to as "client mediated selects" in section 6.2 of the Amazon Dynamo paper. The research seems to indicate that it really only is useful for very large clusters by cutting out a network hop.
The downside would be the duplication of hashing computation and partitioner lookup on the client.
Related
Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.
There are four nodes in cluster. assume them are node A, B, C, D. enabled hinted handoff.
1) create a keyspace with RF=2, and create a table.
2) make node B, C down(nodetool stopdaemon),
3) login in node A with cqlsh,set CONSISTENCY ANY, insert into a row(assume the row will be stored in node B and C). The row was successfully inserted even though the node B,C was down, because the consistency level is ANY. the coordinator(node A) wrote hints.
4) make node A down(nodetool stopdaemon), then remove node A(nodetool removenode ${nodeA_hostId})
5) make node B, C come back(nodetool start)
6) login in any node of B, C, D. and execute select statement with partition key of inserted row. But there is no any data that inserted row on step 3.
These steps lead to data(on step 3 was inserted row) lost.
Is there any problem with the steps I performed above?
If yes, How to deal with this situation?
look forward to your reply, thanks.
CONSISTENCY.ANY will result in data loss in many scenarios. It can be as simple as a polar bear ripping a server off the wall as soon as the write is ACKd to the client (not even applied to a single commitlog yet). This is for writes that are equivelent to being ok with durable_writes=false where latency in client is more important than actually storing the data.
If you want to ensure no data loss, have a RF of at least 3 and use quorum, then any write you get an ack for you can be confident will survive a single node failure. A RF=2 can work with quorum but thats the equivalent of CL.ALL which means any node failure, gc, or hiccup will cause loss of availability.
Important to recognize that hints are not about guaranteed delivery, just possibly reducing the time of convergence when data becomes inconsistent. Repairs within the gc_grace_seconds are still necessary to prevent data loss. If your using weak consistency, durability and low replication you open yourself up for data loss.
Because removenode not stream the data from the node that will be removed. it tells the cluster I am going out of the cluster and balance the existing cluster.
Please refer https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRemoveNode.html
Scenario:
Total Nodes: 3 [ A, B, C ]
Replication factor: 2
Write consistency: Quorum (2 replicas need to ack)
Read consistency: Quorum
Nodes partion ranges:
A [ primary 1-25, replica 26-50 ]
B [ primary 26-50, replica 51-75 ]
C [ primary 51-75, replica 1-25 ]
Question:
Suppose I need to insert data 30 and node A is down. What would be the behavior of Cassandra in this situation? Would Cassandra be able to write the data and report success back to the driver (even though the replica node is down and Cassandra needs 2 nodes to acknowledge a write)?
You only have 1 replica available for the write (B), so you'll get error on write (UnavailableException).
It's better to design your consistency levels / replication factor so that you can tolerate node's failure for a token range (consider bumping your RF to 3).
Also better not to try to solve the availability by following the eventual consistency path (R + W <= N), e.g. putting W=1 in this case. We've tried that and operationally it was not worth the effort.
Is there strong reason behind RF=2? given the scenario, Quorum will not be satisfied in a node down scenario and your writes will fail. I suggest you to revisit your RF.
You have identified one of the key reasons why RF=2 is not an advised replication factor for highly available Cassandra deployments. What will happen is depending on driver behavior (tokenaware on or off).
Node B or C will be chosen as the coordinator
The coordinator will attempt to write to both B and A because a Quorum of 2 is 2
The coordinator will note that node A has not acknowledged the write and thus report back to the client that a Quorum was unable to be achieved.
Note, this does not mean that the write to Node B failed... in fact the value is written to Node B and the coordinator will store a hint for Node A. However you have not achieved your consistency goal so it is likely advisable that you attempt the write again until the node comes back up in most situations. In this specific situation, you are doing effectively ALL which is not going to give the expected behavior in node failure situations.
TLDR, don't use Quorum with RF=2
I am a beginner wrt Hazelcast and trying to understand the following.
In a normal peer to peer set-up with 3 clusters with each being an individual partition. On a request, how is the right partition picked? Is there any router which helps every request? How is the request served?
Thanks
Hazelcast doesn't use consistent hashing so the answer given by Jeremie B is not exactly accurate.
There's a couple of important concepts in Hazelcast:
Partitions - by default there's 271 partitions, that are evenly spread among the nodes. Each node owns "primary" partitions and contains backup "partitions".
Hash function - allows mapping of the key to partition, so in simplified version it looks like this hash(key) % partitionCount = partition
Partition table - keeps the mapping between partitions and nodes, or to be more precise between partitions and replicas. The first replica of each partition is the "primary" partition, the second, third... are the backups.
In order to contact the right node:
a "smart" client keeps track of the "Partition Table".
it uses the hashing algorithm to calculate the partition where the key is stored.
it looks up that partition in the "Partition Table" and connects to the node that contains the given replica.
There's also a concept of a dummy client which doesn't know to which node it should connect. The requests issued by a dummy client are routed to the right node by the node it connects to (if it's not the right node by coincidence).
The core of Hazelcast is based on a "Distributed Hash Table", without a master node. It works with two shared knowledge between nodes:
On ordered list of node participating in the cluster
A hash function
For the 1/, Hazelcast use the list of node ordered from the oldest to the youngest. This information is "easy" to get and don't need to be synchronized through some election. The 2/ is just some code/configuration.
The principle of the DHT is simple: Imagine you have three nodes, ordered A B and C. If you want to know wich node is responsible for a key K, you simply hash the key, and take this value modulo 3. If you have 0, it's the node A, if you have 1, it's the node B, and 2, it's the node C.
Of course, it's only a simplified view of Hazelcast: For example, each structure are split into X partitions, and each node owns more than one partition. Moreover, each partition is replicated. So for each partition, there are one "master" node and several "backup" nodes. But you got the point: no master node, no routing node, every node "knows" where each data belong to.
Setup I have 4 nodes for cassandra cluster (same datacenter). Replication factor is 3. Write consistency is set to ALL
As I understand, Cassandra doesn't have master node. Thus I can write data to any random node as I want. Let's say I have 03 nodes A, B and C. I write to node A record 123, value is 4.
Question 1: Will execute() method in Session object be blocked until the data has been replicated on all replicas ?
Another situation: Let's say the record 123 with value of 5 is also written to node B, 100 millisecond after the request for inserting record 123 with value of 4 arrived at node A.
Question 2: When B is a replica of A, how can cassandra handles this situation in its architecture? Will cassandra node use their internal time to decide which node received the record first? Or all replicas will share the same lock for writing data?
Question 3: When B is not a replica of A, and I have read consistency is set to ALL. If I query for the value of record 123 randomly on node A or B, how can Cassandra handle this situation ?
I'm new to Cassandra thus any answer or help is highly appreciated.
Thank you very much.
Will execute() method in Session object be blocked until the data has been replicated on all replicas ?
The session object will be blocked until N acknowledgements of your mutation(s) are received, N depends on the chosen consistency level. In your case, since you're using ALL, the client will block until acknowledgements are received from all replicas.
When B is a replica of A, how can cassandra handles this situation in its architecture? Will cassandra node use their internal time to decide which node received the record first? Or all replicas will share the same lock for writing data?
The coordinator node (the one which receives the request) will dispatch the write, in parallel, to all replicas. With modern drivers like the Java driver, most of the time the coordinator node is chosen so that it is a replica for the partition being inserted, to avoid one extra network hop.
The role of the coordinator is also to set a timestamp value on each column of your write. This timestamp is the same and will be sent to all replicas
When B is not a replica of A, and I have read consistency is set to ALL. If I query for the value of record 123 randomly on node A or B, how can Cassandra handle this situation ?
In this case, the node which receives the request, called coordinator, will act as a proxy by forwarding the request to the appropriate replica(s) and by forwarding the response(s) it receives back to the client.
Each node knows about the topology of the whole cluster (token range, IP address) so that each node can play the role of a coordinator at any time.
More details about how the data distribution is handled in Cassandra here: http://www.slideshare.net/doanduyhai/cassandra-introduction-apache-con-2014-budapest/18