I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches
Related
Background:
I'm new to Cassandra and still trying to wrap my mind around the internal workings.
I'm thinking of using Cassandra in an application that will only ever have a limited number of nodes (less than 10, most commonly 3). Ideally each node in my cluster would have a complete copy of all of the application data. So, I'm considering setting replication factor to cluster size. When additional nodes are added, I would alter the keyspace to increment the replication factor setting (nodetool repair to ensure that it gets the necessary data).
I would be using the NetworkTopologyStrategy for replication to take advantage of knowledge about datacenters.
In this situation, how does partitioning actually work? I've read about a combination of nodes and partition keys forming a ring in Cassandra. If all of my nodes are "responsible" for each piece of data regardless of the hash value calculated by the partitioner, do I just have a ring of one partition key?
Are there tremendous downfalls to this type of Cassandra deployment? I'm guessing there would be lots of asynchronous replication going on in the background as data was propagated to every node, but this is one of the design goals so I'm okay with it.
The consistency level on reads would probably generally be "one" or "local_one".
The consistency level on writes would generally be "two".
Actual questions to answer:
Is replication factor == cluster size a common (or even a reasonable) deployment strategy aside from the obvious case of a cluster of one?
Do I actually have a ring of one partition where all possible values generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If I were to use a write consistency of "one" does Cassandra always write the data to the node contacted by the client?
Are there other downfalls to this strategy that I don't know about?
Do I actually have a ring of one partition where all possible values
generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If all of my nodes are "responsible" for each piece of data regardless
of the hash value calculated by the partitioner, do I just have a ring
of one partition key?
Not exactly, C* nodes still have token ranges and c* still assigns a primary replica to the "responsible" node. But all nodes will also have a replica with RF = N (where N is number of nodes). So in essence the implication is the same as what you described.
Are there tremendous downfalls to this type of Cassandra deployment?
Are there other downfalls to this strategy that I don't know about?
Not that I can think of, I guess you might be more susceptible than average to inconsistent data so use C*'s anti-entropy mechanisms to counter this (repair, read repair, hinted handoff).
Consistency level quorum or all would start to get expensive but I see you don't intend to use them.
Is replication factor == cluster size a common (or even a reasonable)
deployment strategy aside from the obvious case of a cluster of one?
It's not common, I guess you are looking for super high availability and all your data fits on one box. I don't think I've ever seen a c* deployment with RF > 5. Far and wide RF = 3.
If I were to use a write consistency of "one" does Cassandra always
write the data to the node contacted by the client?
This depends on your load balancing policies at the driver. Often we select token aware policies (assuming you're using one of the Datastax drivers), in which case requests are routed to the primary replica automatically. You could use round robin in your case and have the same effect.
The primary downfall will be increased write costs at the coordinator level as you add nodes. The maximum number of replicas written to I've seen is around 8 (5 for other data centers and 3 for local replicas).
In practice this will mean a reduced stability while performing large or batched writes (greater than 1mb) or a lower per node write TPS.
The primary advantage is you can do a lot of things that'd normally be awful and impossible to do. Want to use secondary indexes? probably will work reasonably well (assuming cardinality and partition size doesn't become your bottleneck there). Want to add a custom UDF that does GroupBy or use very large IN queries it'll probably work.
It is as #Phact mentions not a common usage pattern and I primarily saw it used with DSE Search on low write throughput use cases that had requirements for 'single node' features from Solr, but for those same use cases with pure Cassandra you'd get some benefits on the read side and be able to do expensive queries that are normally impossible in a more distributed cluster.
So I'm taking over our Cassandra cluster after the previous admin left so I'm busy trying to learn as much as I can about it. I'm going through all the documentation on Datastax's site as we're using their product.
That said, on the replication factor part I'm having a bit of trouble understanding why I wouldn't have the replication factor set to the number of nodes I have. I have four nodes currently and one datacenter, all nodes are located in the same physical location as well.
What, if any, benefit would there be to having a replication factor of less than 4?
I'm just thinking that it would be beneficial from a fault tolerance standpoint if each node had its own copy/replica of the data, not sure why I would want less replicas than the number of nodes I have. Are there performance tradeoffs or other reasons? Am I COMPLETELY missing the concept here (entirely possible)?
There are a few reasons why you might not want to increase your RF from 3 to 4:
Increasing your RF effectively multiplies your original data volume
by that amount. Depending on your data volume and data density you
may not want to incur the additional storage hit. RF > number of nodes will help you scale beyond one node's capacity.
Depending on your consistency level you could experience a performance hit. I.E. when writing with quorum consistency level (CL) to an RF of 3 you wait for 2 nodes to come back before confirming the write to the client. In RF of 4 you would be waiting for 3 nodes to come back.
Regardless of the CL, every write will eventually be going to every node. This is more activity on your cluster and may not perform well if your nodes aren't scaled for that workload.
You mentioned fault tolerance. With an RF of 4 and reads on CL one, you can absorb up to 3 of your servers being down simultaneously and your app will still be up. From a fault tolerance perspective this is pretty impressive, but also unlikely. My guess would be if you have 3 nodes down at the same time in the same dc, the 4th is probably also down (natural disaster, flood, who knows...).
At the end of the day it all depends on your needs and C* is nothing if not configurable. An RF of 3 is very common among Cassandra implementations
Check out this deck by Joe Chu
The reason why your RF is often less than the number of nodes in the cluster is explained in the post: Cassandra column family bigger than nodes drive space. This post provides insight into this interesting aspect of Cassandra replication. Here's a summary of the post:
QUESTION: . .. every node has 2Tb drive space and column family is replicated on every node so every node contains a full copy of it . . . after some years that column family will exceed 2Tb . . .
Answer: RF can be less than the number of nodes and does not need to scale if you add more nodes.
For example, if you today had 3 nodes with RF 3, each node will
contain a copy of all the data, as you say. But then if you add 3 more
nodes and keep RF at 3, each node will have half the data. You can
keep adding more nodes so each node contains a smaller and smaller
proportion of the data . . . no limit in principle to
how big your data can be.
I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one: http://www.datastax.com/resources/tutorials/partitioning-and-replication
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra
I have configured cassandra cluster with 4 nodes with 2 seeds. When I run nodetool status, the owns for the individual nodes are as follows,
node1 (seed1) - 24.5%
node2 - 15.0%
node3(seed2) - 46.1%
node4 - 14.5%
should owns should have equal %. If so how can i make that equal. And when i make down node2 and node4 i can able to insert/retrieve data with replication factor 2. But when i make node1 or node2 i can not.Getting the following exception,
SEVERE: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
java.lang.Exception: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
at com.july.storage.cassandra.util.CassandraDBUtil.getData(CassandraDBUtil.java:197)
at com.july.storage.cassandra.util.CassandraDBUtil.doSelect(CassandraDBUtil.java:370)
at com.july.storage.cassandra.action.CassandraHandler.getCall(CassandraHandler.java:127)
at com.july.storage.service.StorageService.GET(StorageService.java:58)
at com.july.storage.cassandra.action.CassandraHandler.main(CassandraHandler.java:571)
Caused by: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:59)
at me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:130)
at me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:100)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.CqlQuery.execute(CqlQuery.java:99)
at com.july.storage.cassandra.util.CassandraDBUtil.getData(CassandraDBUtil.java:179)
Thanks,
Sangeetha
An imbalance can depend on a lot of factors, and you haven't given us very much to go on.
How much data is in your cluster? If there's not very much then this is completly normal. If you only have a thousand rows in the cluster then it's extremely unlikely that you would get an even distribution.
Have you enabled vnodes? If you're using a recent version, like 1.2.5, this is enabled by default. If you have an older version or have disabled vnodes then it's not uncommon to have unbalanced nodes. You can mode nodes manually using nodetool, but don't do it on your production system, test it first in a test environment.
Which partitioner are you using? If you don't know you're using a random partitioner, which should increase the likelihood of an even distribution, but if you've changed to an ordered partitioner you can't expect to get even distribution, you need to move the nodes manually as you add data to the cluster.
The reason why you can't retrieve data when two nodes are down is probably that the row you're retrieving resided on those two nodes, with only four nodes and a replication factor of two it's quite likely -- especially since you can get the data when only the other two nodes are up. Try another row and you will most likely get different results, and try changing the consistency level of the request to one (you didn't say what consistency level you were using, so I assume you're reading at quorum, which with a replication factor of two means that both nodes must be up).
What is the best write/read strategy that is fault tolerant and fast for reads when all nodes are up?
I have 2 replicas in each datacenter and at first I was considering using QUORUM for writes and LOCAL_QUORUM for reads but reads would fail if one node crashes.
The other strategy that I came up with is to use QUORUM for writes and TWO for reads. It should work fast in normal conditions (because we will get results from the nearest nodes first) and it will work slower when any node crashes.
Is this a situation where it is recommended to use consistency level TWO or it is for some other purpose?
When would you use CL THREE?
Do you have a better strategy for consistent and fault tolerant writes/reads?
You first have to chose if you want consistency or availability. If you chose consistency, then you need to have R + W > N, where R is how many nodes you read from, W is how many nodes you write to, and N is the number of replicas.
Then you have to chose if you want reads/writes to always span multiple data centers.
Once you make those choices, you can then chose you consistency level (or it will be dictated to you).
If, for example, you decide you need consistency, and you don't want writes/reads to span multiple data centers, then you can read at LOCAL_QUORUM (which is 2 in your case) and write at ONE, or vice versa.
2 copies per dc is an odd choice. Typically you want to do LOCAL_QUORUM, with 3 replicas in each data center. That lets you read and write only using nodes within a datacenter, but allows 1 node to go down.