Unexpected unequal sharing of data between nodes - cassandra

I have configured cassandra cluster with 4 nodes with 2 seeds. When I run nodetool status, the owns for the individual nodes are as follows,
node1 (seed1) - 24.5%
node2 - 15.0%
node3(seed2) - 46.1%
node4 - 14.5%
should owns should have equal %. If so how can i make that equal. And when i make down node2 and node4 i can able to insert/retrieve data with replication factor 2. But when i make node1 or node2 i can not.Getting the following exception,
SEVERE: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
java.lang.Exception: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
at com.july.storage.cassandra.util.CassandraDBUtil.getData(CassandraDBUtil.java:197)
at com.july.storage.cassandra.util.CassandraDBUtil.doSelect(CassandraDBUtil.java:370)
at com.july.storage.cassandra.action.CassandraHandler.getCall(CassandraHandler.java:127)
at com.july.storage.service.StorageService.GET(StorageService.java:58)
at com.july.storage.cassandra.action.CassandraHandler.main(CassandraHandler.java:571)
Caused by: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:59)
at me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:130)
at me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:100)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.CqlQuery.execute(CqlQuery.java:99)
at com.july.storage.cassandra.util.CassandraDBUtil.getData(CassandraDBUtil.java:179)
Thanks,
Sangeetha

An imbalance can depend on a lot of factors, and you haven't given us very much to go on.
How much data is in your cluster? If there's not very much then this is completly normal. If you only have a thousand rows in the cluster then it's extremely unlikely that you would get an even distribution.
Have you enabled vnodes? If you're using a recent version, like 1.2.5, this is enabled by default. If you have an older version or have disabled vnodes then it's not uncommon to have unbalanced nodes. You can mode nodes manually using nodetool, but don't do it on your production system, test it first in a test environment.
Which partitioner are you using? If you don't know you're using a random partitioner, which should increase the likelihood of an even distribution, but if you've changed to an ordered partitioner you can't expect to get even distribution, you need to move the nodes manually as you add data to the cluster.
The reason why you can't retrieve data when two nodes are down is probably that the row you're retrieving resided on those two nodes, with only four nodes and a replication factor of two it's quite likely -- especially since you can get the data when only the other two nodes are up. Try another row and you will most likely get different results, and try changing the consistency level of the request to one (you didn't say what consistency level you were using, so I assume you're reading at quorum, which with a replication factor of two means that both nodes must be up).

Related

How to determine the sync status is up to date for particular node in a Cassandra cluster?

Suppose I have two node cassandra cluster and they are reside on physically different data-centers. Suppose the database inside that cluster has replication factor is 2 which means every data in that database should be sync with each other. suppose this database is a massive database which have millions of records of its tables. I named those nodes centers as node1 and node2. Suppose node2 is not reliable and there was a crash on that server and take few days to fix and get the server back to up and running state. After that according to my understating there should be a gap between node1 and node2 and it may take significant time to sync node2 with node1. So need a way to measure the gap between node2 and node1 for the mean time of sync happen? After some times how should I assure that node2 is equal to node1? Please correct me if im wrong with this question according to the cassandra architechure.
So let's start with your description. 2 node cluster, which sounds fine, but 2 nodes in 2 different data centers (DCs) - bad design, but doable. Each data center should have multiple nodes to ensure your data is highly available. Anyway, that aside, let's assume you have a 2 node cluster with 1 node in each DC. The replication factor (RF) is defined at the keyspace level (not at the cluster level - each DC will have a RF setting for a particular keyspace (or 0 if not specified for a particular DC)). That being said, you can't have RF=2 for a keyspace for either of your DCs if you only have a single node in each one (RF, which is how many copies of the data that exist, can't be more than the number of nodes in the DC). So let's put that aside for now as well.
You have the possibility for DCs to become out of sync as well as nodes within a DC to become out of sync. There are multiple protections against this problem.
Consistency Level (CL)
This is a lever that you (the client) have to be able to help control how far out of sync things get. There's a trade off between availability v.s. consistency (with performance implications as well). The CL setting is configured at connection time and/or each statement level. For writes, the CL determines how many nodes must IMMEDIATELY ACKNOWLEDGE the write before giving your application the "green light" to move on (a number of nodes that you're comfortable with - knowing the more nodes you immediately require the more consistent your nodes and/or DC(s) will be, but the longer it will take and the less flexibility you have in nodes becoming unavailable without client failure). If you specify less than RF it doesn't mean that RF won't be met, it just means that they don't need to immediately acknowledge the write to move on. For reads, this setting determines how many nodes' data are compared before the result is returned (if cassandra finds a particular row doesn't match from the nodes it's comparing, it will "fix" them during the read before you get your results - this is called read repair). There are a handful of CL options by the client (e.g. ONE, QUORUM, LOCAL_ONE, LOCAL_QUOURM, etc.). Again, there is a trade-off between availability and consistency with the selected choice.
If you want to be sure your data is consistent when your queries run (when you read the data), ensure the write CL + the read CL > RF. You can ensure that's done on a LOCAL level (e.g. the DC that the read/write is occurring on, say, LOCAL_QUORUM) or globally (all DCs with QUORUM). By doing this, you'll be sure that while your cluster may be inconsistent, your results during reads will not be (i.e. the results will be consistent/accurate - which is all that anyone really cares about). With this setting you also allow some flexibility in unavailable nodes (e.g. for a 3 node DC you could have a single node be unavailable without client failure for either reads or writes).
If nodes do become out of sync, you have a few options at this point:
Repair
Repair (run by "nodetool repair") - this is a facility that you can schedule or manually run to reconcile your tables, keyspaces and/or the entire node with other nodes (either in the DC the node resides or the entire cluster). This is a "node level" command and must be run on each node to "fix" things. If you have DSE, Ops Center can run repairs in the background fixing "chunks" of data - cycling the process repetitively.
NodeSync
Similar to repair, this is a DSE specific tool similar to repair that helps keep data in sync (the newer version of repair).
Unavailable nodes:
Hinted Handoff
Cassandra has the ability to "hold onto" changes if nodes become unavailable during writes. It will hang onto changes for a specified period of time. If the unavailable nodes become available before time runs out, the changes are sent over for application. If time runs out, hint collection stops and one of the other options, above, need to be performed to catch things up.
Finally, there is no way to know how inconsistent things are (e.g. 30% inconsistent). You simply try to utilize the tools mentioned above to control consistency without completely sacrificing availability.
Hopefully that makes sense and helps.
-Jim

Possible to take half of Cassandra nodes down without affecting the application?

If there is a 4 node Cassandra cluster, is it possible to configure Cassandra in a way to have half of the nodes down (two in this case) without affecting the applications?
Also how long can nodes be down without Cassandra cancelling the write queue?
This depends on the client CL and DC replication factor.
Let's assume the RF is 4 (all), if the client has a CL=ONE or LOCAL_ONE, the application would not notice any issues. Any other client CL would have problems (e.g. cl=local_quorum of 4 is 3, allowing only a single node to be down).
Let's assume the RF=1 or 2. If CL=ONE or LOCAL_ONE, the application would be unaffected by queries that only manipulate data on the available nodes. However, any access to rows that only exist on the unavailable nodes would be impacted. In other words, CL=ONE or LOCAL_ONE only works if you're manipulating data that has at least one node available to return the response (You only need ONE to respond in this scenario). If the rows you're querying are on both of the unavailable nodes, you'll get an error stating something like: Expected response of 1, received 0.
Many applications configure CL to be some sort of quorum (local or not) - so in that case, the application would certainly fail unless you had RF=5 (so at least 5 nodes). Quorum of 5 is 3, allowing for 2 nodes to fail.
Hopefully that makes sense.
Yes, assuming you are talking about all four nodes in one data centre, if you set your replication factor to 3 or greater and your read and write consistency level to ONE.
For writes the nodes that are up will store hints for the nodes that are down, so when they come back up they can write the data. How long the nodes store these hints can be set in cassandra.yaml.

How does Cassandra partitioning work when replication factor == cluster size?

Background:
I'm new to Cassandra and still trying to wrap my mind around the internal workings.
I'm thinking of using Cassandra in an application that will only ever have a limited number of nodes (less than 10, most commonly 3). Ideally each node in my cluster would have a complete copy of all of the application data. So, I'm considering setting replication factor to cluster size. When additional nodes are added, I would alter the keyspace to increment the replication factor setting (nodetool repair to ensure that it gets the necessary data).
I would be using the NetworkTopologyStrategy for replication to take advantage of knowledge about datacenters.
In this situation, how does partitioning actually work? I've read about a combination of nodes and partition keys forming a ring in Cassandra. If all of my nodes are "responsible" for each piece of data regardless of the hash value calculated by the partitioner, do I just have a ring of one partition key?
Are there tremendous downfalls to this type of Cassandra deployment? I'm guessing there would be lots of asynchronous replication going on in the background as data was propagated to every node, but this is one of the design goals so I'm okay with it.
The consistency level on reads would probably generally be "one" or "local_one".
The consistency level on writes would generally be "two".
Actual questions to answer:
Is replication factor == cluster size a common (or even a reasonable) deployment strategy aside from the obvious case of a cluster of one?
Do I actually have a ring of one partition where all possible values generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If I were to use a write consistency of "one" does Cassandra always write the data to the node contacted by the client?
Are there other downfalls to this strategy that I don't know about?
Do I actually have a ring of one partition where all possible values
generated by the partitioner go to the one partition?
Is each node considered "responsible" for every row of data?
If all of my nodes are "responsible" for each piece of data regardless
of the hash value calculated by the partitioner, do I just have a ring
of one partition key?
Not exactly, C* nodes still have token ranges and c* still assigns a primary replica to the "responsible" node. But all nodes will also have a replica with RF = N (where N is number of nodes). So in essence the implication is the same as what you described.
Are there tremendous downfalls to this type of Cassandra deployment?
Are there other downfalls to this strategy that I don't know about?
Not that I can think of, I guess you might be more susceptible than average to inconsistent data so use C*'s anti-entropy mechanisms to counter this (repair, read repair, hinted handoff).
Consistency level quorum or all would start to get expensive but I see you don't intend to use them.
Is replication factor == cluster size a common (or even a reasonable)
deployment strategy aside from the obvious case of a cluster of one?
It's not common, I guess you are looking for super high availability and all your data fits on one box. I don't think I've ever seen a c* deployment with RF > 5. Far and wide RF = 3.
If I were to use a write consistency of "one" does Cassandra always
write the data to the node contacted by the client?
This depends on your load balancing policies at the driver. Often we select token aware policies (assuming you're using one of the Datastax drivers), in which case requests are routed to the primary replica automatically. You could use round robin in your case and have the same effect.
The primary downfall will be increased write costs at the coordinator level as you add nodes. The maximum number of replicas written to I've seen is around 8 (5 for other data centers and 3 for local replicas).
In practice this will mean a reduced stability while performing large or batched writes (greater than 1mb) or a lower per node write TPS.
The primary advantage is you can do a lot of things that'd normally be awful and impossible to do. Want to use secondary indexes? probably will work reasonably well (assuming cardinality and partition size doesn't become your bottleneck there). Want to add a custom UDF that does GroupBy or use very large IN queries it'll probably work.
It is as #Phact mentions not a common usage pattern and I primarily saw it used with DSE Search on low write throughput use cases that had requirements for 'single node' features from Solr, but for those same use cases with pure Cassandra you'd get some benefits on the read side and be able to do expensive queries that are normally impossible in a more distributed cluster.

Cassandra rack concept and database structure

I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches

Is to possible to read from cassandra cluster even at any node failure

I have a Cassandra cluster with 4 nodes, is it possible to read the data only from the available nodes, except the node that is down, is this possible? or is there any configurable property to handle this type of scenario.
Thanks
You can do this with replication, yes. There are a few things you need:
Set replication factor at least 2. The more replicas, the more failed nodes you can cope with. However, the more replicas you have the worse your performance is since more nodes duplicate the work.
Choose an appropriate consistency level. The consistency level (CL) determines how many nodes need to be involved with a read or write operation. CL.ALL means use all replicas so you can't tolerate any failures. CL.ONE means use just one node. CL.QUORUM means a majority of replicas (RF/2+1)
You can read and write data from any node, not just ones containing that data. If you use a client library like Hector, you should tell it about all nodes and it will avoid ones that are down, as well as load balance amongst the available nodes.

Resources