I'm trying to explain this to myself ..
Here's how I understand it :
Suppose I have 4 nodes, RF = 3 and CL = QUORUM for both read & write.
In my table (id, title) I write data {id = 1, title = 'mytext'} then write will return success if 2 nodes write this successfully. Say it's successfull, we now have (at least) 2 nodes with {id = 1, title = 'mytext'} and potentially one node with (id = 1, title = 'olddata')
Then any subsequent read (where id = 1) needs to find 2 nodes (QUORUM) with same data in order to return successfully which will never occur with the old data. because there's a maximum of 1 node remaining containing the old data.
Is that accurate?
Number of nodes is not that important, more important is the RF i.e. how many nodes have the copy of the data. So CL Quorum means:
2 nodes have to confirm it on write
2 nodes have to confirm it on read
Under the hood the request will not actually go to all of the nodes. Based on some statistics, and the component snitch the coordinator will choose one of the nodes that have the data and the other nodes will just be asked for a hash, not the whole data. If the received data matches hash wise, then it's returned to the client. If not, coordinator will request the full data from the other nodes and resolve conflict by using last write wins policy.
In order to be able to do this, clocks have to be in sync. Usually by the ntp ... but some guys go even as far as installing GPS receivers on hosts to keep the clock skew really tight.
In short your reasoning is totally o.k.
And if you want to learn a bit more about all the combination it doesn't hurt to look at the following link:
https://www.ecyrd.com/cassandracalculator/
Related
As part of the Cassandra developer training from Datastax, I had the below question
"In a full network partition, that is, parts of the cluster are completely disconnected from the whole, only the largest group of nodes can still satisfy queries."
I gave the answer as "YES" here. Because even though we break the Cassandra cluster, the largest group can still reset itself to satisfy the consistency level and serve the request.
But I see my answer was wrong. Can any one please explain me why ?
Because of that little thing we call "data."
The problem doesn't mention specific numbers of nodes that went down, or the replication factor (RF) that the keyspaces are defined with. Because of that, you have no guarantees whatsoever as to the specific token ranges (and replicas thereof) that could also be down. When in all likelihood, complete sets of data replicas are down in this case.
the largest group can still reset itself
I think I know what you mean here. When nodes are decomm'd or removed, the remaining nodes adjust their token range assignments to ensure 100% data coverage. That's true. However, the data associated with those ranges doesn't automatically move with them.
That doesn't happen until a repair operation is run. And if multiple nodes are down, (again) including complete sets of data replicas, you may not have the nodes you need to stream some of the data.
Example:
Say we have a 12 node cluster (in a single DC), keyspaces defined with RF=3, and the nodes become "split" into groups of 2 (group A), 3 (group B), and 7 (group C).
If group C is still serving queries, there will be some data partitions which originally:
Had all replicas in group C. These queries will still succeed.
Had 1 replica in group A or B. These queries will still succeed # QUORUM or less, but will now fail at ALL.
Had 2 replicas in groups A or B (or both). These queries will still succeed # ONE, but will now fail for all other consistency levels.
Had all data on nodes in both groups A and B. All queries for these partitions will fail.
Had all data on nodes in group B. All queries for these partitions will fail.
Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.
I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Based on above description when i use murmur3partitioner,
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
when i read request to node 1 , will it internally connect with node 2 for consistency ?
And my intention is to make a replica and both nodes should server the request independently without inter communication.
First of all, please try to ask only one question at per post.
I have 2 nodes with replication factor = 1, it means will have 1 set of data copy in each node.
Incorrect. A RF=1 indicates that your entire cluster will have 1 copy of the data.
Will data shared among nodes ? like 50% of data in node1 and 50% of data in node 2?
That is what it will try to do. Do note that it probably won't be exact. It'll probably be something like 49/51-ish.
when i read request to node 1 , will it internally connect with node 2 for consistency ?
With a RF=1, no it will not. Based on the hashed token value of your partition key, it will be directed only to the node which contains the data.
As an example, with a RF=2 with 2 nodes, it would depend on the consistency level set for your operation. Reading at ONE will always read only one replica. Reading at QUORUM will always read from 2 replicas with 2 nodes (after all, QUORUM of 2 equals 2). Reading at ALL will require a response from all replicas, and initiate a read repair if they do not agree.
Important to note, but you cannot force your driver to connect to a specific Cassandra node. You may provide one endpoint, but it will find the other via gossip, and use it as it needs to.
We have less than 50GB of data for a table and we are trying to come up with a reasonable design for our Cassandra database. With so little data we are thinking of having all data on each node (2 node cluster with replication factor of 2 to start with).
We want to use Cassandra for easy replication - safeguarding against failover, having copies of data in different parts of the world and Cassandra is brilliant for that.
Moreover, best model that we currently came up with would imply that a single query (consistency level 1-2) would involve getting data from multiple partitions (avg=2, 90th %=20). Most of the queries would ask for data from <= 2 partitions but some might go up to 5k.
So my question here is whether it is really a problem? Is Cassandra slow to retrieve data from multiple partitions if we ensure that all the partitions are on the single node?
EDIT:
Misread question my apologies for other folks coming here later. Please look at the code for TokenAwarePolicy as a basis to determine replica owners, once you have that you can combine your query with the IN query to get multiple partitions from a single node. Be mindful of total query size still.
Original for reference:
Don't get data from multiple partitions in a single query, the detail of the why is here
The TLDR you're better off querying asynchronously from multiple different partitions that requiring the coordinator to do that work.
You require more of a retry if you fail (which is particularly ugly when you have a very large partition or two in that query)
You're waiting on the slowest query for any response to come back, when you could be returning part of the answer as it comes in (or even include a progress meter based on the parts being done).
I did some testing on my machine and results are contradicting what Ryan Svihla proposed in another answer.
TL;DR storing same data in multiple partitions and retrieving via IN operator is much slower than storing the data in a single partition and retrieving it in one go. PLEASE NOTE, that all of the action is on a single Cassandra node (as the conclusion should be more than obvious for a distributed Cassandra cluster)
Case A
Insert X rows into a single partition of the table defined below. Retrieve all of them via SELECT specifying the partition key in WHERE.
Case B
Insert X rows each into a separate partition of the table defined below. Retrieve all of them via SELECT specifying multiple partition keys using WHERE pKey IN (...).
Table definition
pKey: Text PARTITION KEY
cColumn: Int CLUSTERING KEY
sParam: DateTime STATIC
param: Text (size of each was 500 B in tests)
Results
Using Phantom Driver
X = 100
A - 10ms
B - 150ms
r = 15
X = 1000
A - 20ms
B - 1400ms
r = 70
X = 10000
A - 100ms
B - 14000ms
r = 140
Using DevCenter (it has a limit of 1000 rows retrieved in one go)
X = 100
A - 20ms
B - 900ms
r = 45
X = 1000
A - 30ms
B - 1300ms
r = 43
Technical details:
Phantom driver v 2.13.0
Cassandra 3.0.9
Windows 10
DevCenter 1.6
I need some help with data modelling for Cassandra.
Here is the problem description:
I have 3 servers processing user requests NodeA,NodeB and NodeC. I have a 1000 different developers ( potentially 10000 ) and must maintain a $ balance for each of them per processing node.
I can see 2 ways of modeling this:
1) CF with developerid+balanceid as the row key. The column names will be NodeA, NodeB and NodeC.
create table {
developerBalanceid int primarykey;
nodeA varchar;
nodeB varchar;
nodeC varchar;
}
2) CF with wide rows with node ids as keys. The column name will be developerid+balanceid. This seems similar to time-series data being stored in Cassandra.
create table {
nodeid varchar as primary key;
developerBalanceid int; //this will be dynamic columns
}
Operations:
a) Writes: Every 5 seconds , every node will update the $ balance for every developer. More specifically, at every time t+5, node A will write 1000 balance values. node B will write a 1000 balance values and node C too.
b) Reads: Reads also occur every 5 seconds to read a specific developerBalance.
It appears 2) is the best way to model this.
I do have some concerns about how wide rows will work with the query I want to do.
In the worst case , what is the number of iops that a wide row read will incur.
Should I be looking at other optimizations like compression on the writes?
I understand that I can run some tests and examine performance. But I would like to hear other experiences too.
The essential rule when modeling with Cassandra is "model from your queries". The main argument in your question is:
read a specific developerBalance.
If you query by developerBalance, then developerBalance must be the beginning of your primary key. Your solution 1 is better to me.
With the solution 2 you won't be able to write
select * from my_table where developerBalanceid=?
... without scanning the whole cluster
You must understand what Cassandra querying can not do, what are partition key and cluster key. Another link