Cassandra data not distributed evenly - cassandra

I have a 3 node cluster with a replication factor of 3. nodetool status shows that one node has 100gb of data, another 90gb, and another 30gb. Each node owns 100% of the data.
I'm using a unique url as my clustering key, so I would imagine data should be spread around evenly. Even such, since RF is 3 all nodes should contain the same amount of data. Any ideas what's going on?
Thanks.

What is the write consistency level being used? I guess it might be "consistency one" and hence data would get eventually replicated. Especially if the data was dumped in one shot. Try to use "consistency local_quorum" to avoid this issue in future.
Try running a "nodetool repair" and it should bring the data back in sync in all nodes.
Remember the writes from "cqlsh" are by default with "consistency one".

Related

Why do tables get out of sync over time when Write Consistency ALL is used?

Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".

Cassandra shows incorrect load

As you can in the output that second node Owns shows 66.1% and Load size is 834.12GB whereas the third node has the lower load size(801.56GB) compared to node2 but Owns percentage is high.
Does this mean, the output is not accurate.
The percentages will not match with your actual data stored on disk. Note that the heading reads Owns (effective). That column indicates the percentage of the available token ranges that the node is responsible for. As each node is responsible for about two-thirds, I'm going to guess that you have specified a replication factor of two.
While Cassandra's Murmur3 hash does a good job of spreading data around evenly, large partitions can put more load on a small number of nodes (as Alex indicated).
It could be that some of the load is data that the node is not responsible for anymore. For example if you had one node first and loaded it with 100gb. Then you change RF to 2 and add a second node. The first node still has the data even after streaming but it does not own that data. You can remove this data with nodetool cleanup.
Or it could be that a node was down for some time and you haven't run repair yet.
Edit: As Alex mentioned, it's also possible that you have large partitions and then the data won't get distributed as well.

How to Manage Node Failure with Cassandra Replication Factor 1?

I have a three node Cassandra (DSE) cluster where I don't care about data loss so I've set my RF to 1. I was wondering how Cassandra would respond to read/write requests if a node goes down (I have CL=ALL in my requests right now).
Ideally, I'd like these requests to succeed if the data exists - just on the remaining available nodes till I replace the dead node. This keyspace is essentially a really huge cache; I can replace any of the data in the event of a loss.
(Disclaimer: I'm a ScyllaDB employee)
Assuming your partition key was unique enough, when using RF=1 each of your 3 nodes contains 1/3 of your data. BTW, in this case CL=ONE/ALL is basically the same as there's only 1 replica for your data and no High Availability (HA).
Requests for "existing" data from the 2 up nodes will succeed. Still, when one of the 3 nodes is down a 1/3 of your client requests (for the existing data) will not succeed, as basically 1/3 of you data is not available, until the down node comes up (note that nodetool repair is irrelevant when using RF=1), so I guess restore from snapshot (if you have one available) is the only option.
While the node is down, once you execute nodetool decommission, the token ranges will re-distribute between the 2 up nodes, but that will apply only for new writes and reads.
You can read more about the ring architecture here:
http://docs.scylladb.com/architecture/ringarchitecture/

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

Concurrent writes to cassandra replicas - Is duplication possible?

I have a two machine cluster which is running Cassandra 1.2.6. I am using a keyspace which has a replication factor of 2. But my application demands me to write to both the replicas in parallel and also let the Cassandra do the replication and hoping that Cassandra does not duplicate the key/value on the replica nodes.
For example:
I have nodes Node1 and Node2. I have a keyspace which has replication factor 2 configured on it and a column family to push key/value pairs
I use a python client (pycassa) to write to the cluster.
A key, "KeyX", hashes to Node1 and Node2. (I find out which key hashes to which servers through the node tool command. (`$nodetool getendpoints KeyspaceName ColumnFamilyName KeyHexString`)
I use a client to write (KeyX, Value) concurrently to the nodes Node1 and Node2. (In the connection pool I give only the specific server name)
When writing, I wait for one write to succeed (to the master). (Consistency level ONE)
Now, I monitor through the `$nodetool status` command the amount of disk space that the cluster uses.
I write around 100 keys each having 2MB values.
Ideally this should store around 400MB on disk with some overhead for storing keys which should be marginal compared to the value sizes that I using.
Observations:
If I do not write to all the nodes that the key hashes to, Cassandra internally handles replication and the data size is around 400MB. (200MB on each node for 100 keys with 2MB value)
If I do write to all the nodes the key hashes to, Cassandra is writing more than the expected amount of data to the disk. It is as high as 15% more. In our tests Cassandra write ~460MB instead of 400MB.
My question is, is the behavior (15% overhead) expected? Is there any configuration that we need to tweak so that Cassandra properly handles concurrent writes to all the replicas.
Thanks!
There are two possible causes of the 15% extra space that I can think of.
One is because sometimes a replica will store two copies of a column temporarily. If you write a column twice in Cassandra at slightly different times, the two copies may go into separate memtables so end up in separate SSTables on disk. At some point later, when the SSTables get merged through the compaction process, the older value will be discarded, freeing up the space. In your test you could run nodetool compact to force compaction and see if the space usage goes down.
Another possible cause depends on how you did the test when you didn't write to both nodes. If you did this at consistency level ONE, it is possible some of the writes were dropped by the other replica, so it doesn't have all the keys yet. You can be sure it does by running nodetool repair. So the space used in your first observation may not be for all the keys.
You should be aware that writing to all replicas at consistency level ONE does not guarantee that each replica holds a copy. The node that is receiving the data does not have to store it to return success for the write, even if it is a replica. It may be overloaded (in your workload, this would most likely be due to not enough I/O to write the data out) and drop the write, while succeeding in writing it to a different replica. This would cause less space to be used in your second observation, but probably isn't happening in your test since it is a relatively small amount of data.
If you need to guarantee you have two copies you should write at consistency level ALL and only write it once.

Resources