So I'm taking over our Cassandra cluster after the previous admin left so I'm busy trying to learn as much as I can about it. I'm going through all the documentation on Datastax's site as we're using their product.
That said, on the replication factor part I'm having a bit of trouble understanding why I wouldn't have the replication factor set to the number of nodes I have. I have four nodes currently and one datacenter, all nodes are located in the same physical location as well.
What, if any, benefit would there be to having a replication factor of less than 4?
I'm just thinking that it would be beneficial from a fault tolerance standpoint if each node had its own copy/replica of the data, not sure why I would want less replicas than the number of nodes I have. Are there performance tradeoffs or other reasons? Am I COMPLETELY missing the concept here (entirely possible)?
There are a few reasons why you might not want to increase your RF from 3 to 4:
Increasing your RF effectively multiplies your original data volume
by that amount. Depending on your data volume and data density you
may not want to incur the additional storage hit. RF > number of nodes will help you scale beyond one node's capacity.
Depending on your consistency level you could experience a performance hit. I.E. when writing with quorum consistency level (CL) to an RF of 3 you wait for 2 nodes to come back before confirming the write to the client. In RF of 4 you would be waiting for 3 nodes to come back.
Regardless of the CL, every write will eventually be going to every node. This is more activity on your cluster and may not perform well if your nodes aren't scaled for that workload.
You mentioned fault tolerance. With an RF of 4 and reads on CL one, you can absorb up to 3 of your servers being down simultaneously and your app will still be up. From a fault tolerance perspective this is pretty impressive, but also unlikely. My guess would be if you have 3 nodes down at the same time in the same dc, the 4th is probably also down (natural disaster, flood, who knows...).
At the end of the day it all depends on your needs and C* is nothing if not configurable. An RF of 3 is very common among Cassandra implementations
Check out this deck by Joe Chu
The reason why your RF is often less than the number of nodes in the cluster is explained in the post: Cassandra column family bigger than nodes drive space. This post provides insight into this interesting aspect of Cassandra replication. Here's a summary of the post:
QUESTION: . .. every node has 2Tb drive space and column family is replicated on every node so every node contains a full copy of it . . . after some years that column family will exceed 2Tb . . .
Answer: RF can be less than the number of nodes and does not need to scale if you add more nodes.
For example, if you today had 3 nodes with RF 3, each node will
contain a copy of all the data, as you say. But then if you add 3 more
nodes and keep RF at 3, each node will have half the data. You can
keep adding more nodes so each node contains a smaller and smaller
proportion of the data . . . no limit in principle to
how big your data can be.
Related
Highly appreciate if someone can help with below questions.
*RF= Replication Factor
*CL= Consistency Level
We have requirement of strong Consistency and higher Availability. So, I have been testing RF and CL for 7 nodes ScyllaDB cluster , by keeping RF=7 (100% data on each node) and CL=QUORUM.
What will happen to data copy / replication if 2 nodes goes down ? Does it replicate 2 down nodes data (6th & 7th copy) on to remaining 5 nodes?
or will it simply discard these copies ? What will be effect of RF=7 when there are only 5 active nodes ?
I could not find anything in logs. Do we have any document/link reference for this case ? Or how can I verify and prove this behaviour? Please explain?
With RF=7, the data is always replicated to 7 nodes.
When a node (or two) goes down, the rest of the five nodes already have a copy, and no additional streaming is required.
Using CL=QUORUM, even three nodes down, will not hurt your HA or consistency.
When the fail nodes come back to life, they will be sync, either automatically using Hinted Handoff (for a short failure) or with Repair (for longer failure)[1]
If you replace a dead node[2], the other replicas will stream the data to it till it is up to speed with the
[1] https://docs.scylladb.com/architecture/anti-entropy/
[2] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/replace_dead_node/
Data will always replicate to all nodes cause you have set RF=7 if 2 nodes down then remaining nodes will store hints for those nodes once, nodes come up remaining nodes will replicate the data automatically based on hint period.If hint period(default 3 hours) expired then you need to run manual repair to get data sync in the cluster.
This may sound like a dumb question but still I wanted someone/expert to answer/confirm this.
Lets say I have a 3 node cassandra cluster. Lets say I have one database and just one table. For this single table lets say I get a throughput of 1K writes/second with 3 node cassandra. If tomorrow my write load on this table increases/scales to 10K or 20K, will I be able to handle this write load by increasing the size of cluster by say 10x or 20x?
My understanding of cassandra says it is possible (as cassandra is both read and write scalable) but would want an expert to confirm.
Yes, Cassandra has Linear Scalability.
The scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.
Source : https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e
Yes - but only if your data is properly modeled - your data especially needs to be distributed evenly among your partition keys (since they map to specific replica nodes) to avoid hot spots. Given that, yes cassandra will scale horizontally well.
A "table" in cassandra is distributed among all nodes in your cluster. Each node is responsible for a range of tokens which are hashes of the partition key portion of your primary key.
Now, if you double your node count for example - the existing token ranges are split in half and distributed while bootstrapping the new nodes. So each node will only handle half of your inital requests. If you double your requests afterwards, each node will have roughly the same load as before.
For read intensive requests - choosing a higher replication factor helps when you can live with stale data for a while (e.g. read and write at a low consistency level).
There are good tutorials from DataStax available here https://academy.datastax.com/
Datastax states that:
What are the benefits of Apache Cassandra?
Massively scalable ring architecture: Based on the best of Amazon Dynamo and Google BigTable, Cassandra’s peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability.
Linear scale performance: Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.
So the answer is YES, it is possible. It may take some time to adding a new node and redistribute tokens. But it will scale as you change the number of nodes.
If you need more info to understand how it will scale , check this links below:
Benchmarking Cassandra Scalability on AWS
Adding nodes to Cassandra
Adding, replacing, moving and removing nodes
Yes, it is so, but with the single remark. You should consider replication factor (RF) and consistency level (CL) as they affect the scaling behaviour also.
For example, if you initially have the 10 nodes with RF=3, and you increase the nodes count up to 20 with the same RF=3, you'll get the linear increase in write throughput.
But if you want to increase the read throughput, you need to increase RF. And with the increased RF you had to decrease write consistency level to improve write throughput.
To summarize, you could not increase both read and write throughput in a linear way with the same RF and CL params.
One of my C* cluster design expects nodes to hold between 1 and 2 TBs of data each, and I expect a huge amount of data in a few months. Pretending I can get 1PB of data and that each node will hold 1TB of data, that means I should plan for a 1000x growth over time, and starting from a "misere" N=3 nodes with RF=3 for 1TB of data, I would keep adding nodes up to N=3000 over time.
The high number of nodes involved put some pressures on how to deal with disks/servers failures, keep the cluster healthy and how to perform backups.
Healthy Cluster
Assuming you don't want any data loss and perform reads/writes with LOCAL_QUOROM Consistency Level, using RF=3 when you have N<10 nodes is very reasonable, however when you go up with N the MTBF of your nodes goes down accordingly, so keeping RF=3 is going to call for troubles and you may want to "upgrade" to RF=5 or more.
Q1: What's a good RF that would fight against the decreased MTBF and keep the cluster healthy (and you sleeping peacefully) with say 100 nodes? and 500? and 1000?
BACKUP
Making backups of all the nodes seems to be a bit not viable due to the following reasons:
Doubles the costs of the solution instantly.
I would backup the redundant data due to the RF of the cluster.
I see no way to remove the redundancy introduced by the RF and backup only the data expect adding another DC to C* with RF=2 (I could go for RF=1 but if I lose one node all the backup cluster is down). That
would mean adding 2/RF of the cost of the cluster for backup
purposes which seems to me a good alternative.
Q2: Are there any other methods to perform this task without increasing too much the cost of the solution?
I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one: http://www.datastax.com/resources/tutorials/partitioning-and-replication
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra
I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches