How does Cassandra partitioning work when replication factor == cluster size? - cassandra

Background:
I'm new to Cassandra and still trying to wrap my mind around the internal workings.
I'm thinking of using Cassandra in an application that will only ever have a limited number of nodes (less than 10, most commonly 3). Ideally each node in my cluster would have a complete copy of all of the application data. So, I'm considering setting replication factor to cluster size. When additional nodes are added, I would alter the keyspace to increment the replication factor setting (nodetool repair to ensure that it gets the necessary data).
I would be using the NetworkTopologyStrategy for replication to take advantage of knowledge about datacenters.
In this situation, how does partitioning actually work? I've read about a combination of nodes and partition keys forming a ring in Cassandra. If all of my nodes are "responsible" for each piece of data regardless of the hash value calculated by the partitioner, do I just have a ring of one partition key?
Are there tremendous downfalls to this type of Cassandra deployment? I'm guessing there would be lots of asynchronous replication going on in the background as data was propagated to every node, but this is one of the design goals so I'm okay with it.
The consistency level on reads would probably generally be "one" or "local_one".
The consistency level on writes would generally be "two".
Actual questions to answer:
Is replication factor == cluster size a common (or even a reasonable) deployment strategy aside from the obvious case of a cluster of one?
Do I actually have a ring of one partition where all possible values generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If I were to use a write consistency of "one" does Cassandra always write the data to the node contacted by the client?
Are there other downfalls to this strategy that I don't know about?

Do I actually have a ring of one partition where all possible values
generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If all of my nodes are "responsible" for each piece of data regardless
of the hash value calculated by the partitioner, do I just have a ring
of one partition key?
Not exactly, C* nodes still have token ranges and c* still assigns a primary replica to the "responsible" node. But all nodes will also have a replica with RF = N (where N is number of nodes). So in essence the implication is the same as what you described.
Are there tremendous downfalls to this type of Cassandra deployment?
Are there other downfalls to this strategy that I don't know about?
Not that I can think of, I guess you might be more susceptible than average to inconsistent data so use C*'s anti-entropy mechanisms to counter this (repair, read repair, hinted handoff).
Consistency level quorum or all would start to get expensive but I see you don't intend to use them.
Is replication factor == cluster size a common (or even a reasonable)
deployment strategy aside from the obvious case of a cluster of one?
It's not common, I guess you are looking for super high availability and all your data fits on one box. I don't think I've ever seen a c* deployment with RF > 5. Far and wide RF = 3.
If I were to use a write consistency of "one" does Cassandra always
write the data to the node contacted by the client?
This depends on your load balancing policies at the driver. Often we select token aware policies (assuming you're using one of the Datastax drivers), in which case requests are routed to the primary replica automatically. You could use round robin in your case and have the same effect.

The primary downfall will be increased write costs at the coordinator level as you add nodes. The maximum number of replicas written to I've seen is around 8 (5 for other data centers and 3 for local replicas).
In practice this will mean a reduced stability while performing large or batched writes (greater than 1mb) or a lower per node write TPS.
The primary advantage is you can do a lot of things that'd normally be awful and impossible to do. Want to use secondary indexes? probably will work reasonably well (assuming cardinality and partition size doesn't become your bottleneck there). Want to add a custom UDF that does GroupBy or use very large IN queries it'll probably work.
It is as #Phact mentions not a common usage pattern and I primarily saw it used with DSE Search on low write throughput use cases that had requirements for 'single node' features from Solr, but for those same use cases with pure Cassandra you'd get some benefits on the read side and be able to do expensive queries that are normally impossible in a more distributed cluster.

Related

The contact between Replication factor and Resource Usage

I am a Cassandra user in china. Recently we want to use Cassandra in our production environment. But I don't know the impact of data replica factor and resource consumption.
My stress test show that 3 replication factor use three times more resources than 1 replication factor. But I'm not sure it's right.
So, I would like to ask if there is a formula for replication factor and resource consumption? Or has anyone ever tested it?
I'm very grateful if anyone can reply me;
First of all, RF=3 means you need at least three servers (obviously). But really, it depends on what you mean by "resources." If that's mainly referring to disk space, then "yes" setting a RF=3 will use 3x the disk space that a single copy (RF=1) would.
So why would you want that? Because supporting data loads in highly-available (HA) scenarios is what Cassandra does really well. This means that Cassandra needs to be able to continue to serve requests if a node should fail. Achieving that means setting RF>1.
As for the remaining resources, if you're referring to network, CPU & RAM as well, then the answer is "it depends." An application can choose to query at different consistency levels, such as ONE, QUORUM, or ALL (and others). For ONE, it does just what it says: an operation (read or write) waits for acknowledgement from a single node.
So if an app is querying at a consistency of ONE, the answer is "no," it won't use three times the resources if RF=3.
Cassandra is distributed database so it stores the data based on partition and hash algorithm. We can configure replica of our data based on requirement and application nature. Default Cassandra cluster with minimum 3 node recommended for production but you should use or configure the replication factor(replica/copy of data) totally on your wish.
If you use 3 node cluster with RF=3 then your data will be distributed on each node (approx 1/3 data on each node). We need to consider the resource here for all 3 nodes like disk, CPU, Memory, I/O etc equally for better performance. However, we can tune multiple things(like consistency, compaction, network, OS) inside the Cassandra to improve the performance and resource effective. 3 copy of data will use more memory and disk as compared to 1 copy of data. But if you consider availability and performance you should use at least 2 copy of data. you can refer below link for more details regarding RF calculation etc:-
https://www.ecyrd.com/cassandracalculator/

Replication without partitioning in Cassandra

In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views

cassandra write throughput and scalability

This may sound like a dumb question but still I wanted someone/expert to answer/confirm this.
Lets say I have a 3 node cassandra cluster. Lets say I have one database and just one table. For this single table lets say I get a throughput of 1K writes/second with 3 node cassandra. If tomorrow my write load on this table increases/scales to 10K or 20K, will I be able to handle this write load by increasing the size of cluster by say 10x or 20x?
My understanding of cassandra says it is possible (as cassandra is both read and write scalable) but would want an expert to confirm.
Yes, Cassandra has Linear Scalability.
The scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.
Source : https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e
Yes - but only if your data is properly modeled - your data especially needs to be distributed evenly among your partition keys (since they map to specific replica nodes) to avoid hot spots. Given that, yes cassandra will scale horizontally well.
A "table" in cassandra is distributed among all nodes in your cluster. Each node is responsible for a range of tokens which are hashes of the partition key portion of your primary key.
Now, if you double your node count for example - the existing token ranges are split in half and distributed while bootstrapping the new nodes. So each node will only handle half of your inital requests. If you double your requests afterwards, each node will have roughly the same load as before.
For read intensive requests - choosing a higher replication factor helps when you can live with stale data for a while (e.g. read and write at a low consistency level).
There are good tutorials from DataStax available here https://academy.datastax.com/
Datastax states that:
What are the benefits of Apache Cassandra?
Massively scalable ring architecture: Based on the best of Amazon Dynamo and Google BigTable, Cassandra’s peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability.
Linear scale performance: Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.
So the answer is YES, it is possible. It may take some time to adding a new node and redistribute tokens. But it will scale as you change the number of nodes.
If you need more info to understand how it will scale , check this links below:
Benchmarking Cassandra Scalability on AWS
Adding nodes to Cassandra
Adding, replacing, moving and removing nodes
Yes, it is so, but with the single remark. You should consider replication factor (RF) and consistency level (CL) as they affect the scaling behaviour also.
For example, if you initially have the 10 nodes with RF=3, and you increase the nodes count up to 20 with the same RF=3, you'll get the linear increase in write throughput.
But if you want to increase the read throughput, you need to increase RF. And with the increased RF you had to decrease write consistency level to improve write throughput.
To summarize, you could not increase both read and write throughput in a linear way with the same RF and CL params.

What does it mean when we say cassandra is scalable?

I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one: http://www.datastax.com/resources/tutorials/partitioning-and-replication
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra

Is to possible to read from cassandra cluster even at any node failure

I have a Cassandra cluster with 4 nodes, is it possible to read the data only from the available nodes, except the node that is down, is this possible? or is there any configurable property to handle this type of scenario.
Thanks
You can do this with replication, yes. There are a few things you need:
Set replication factor at least 2. The more replicas, the more failed nodes you can cope with. However, the more replicas you have the worse your performance is since more nodes duplicate the work.
Choose an appropriate consistency level. The consistency level (CL) determines how many nodes need to be involved with a read or write operation. CL.ALL means use all replicas so you can't tolerate any failures. CL.ONE means use just one node. CL.QUORUM means a majority of replicas (RF/2+1)
You can read and write data from any node, not just ones containing that data. If you use a client library like Hector, you should tell it about all nodes and it will avoid ones that are down, as well as load balance amongst the available nodes.

Resources